| Literature DB >> 22638577 |
Abstract
Competitive gene set tests are commonly used in molecular pathway analysis to test for enrichment of a particular gene annotation category amongst the differential expression results from a microarray experiment. Existing gene set tests that rely on gene permutation are shown here to be extremely sensitive to inter-gene correlation. Several data sets are analyzed to show that inter-gene correlation is non-ignorable even for experiments on homogeneous cell populations using genetically identical model organisms. A new gene set test procedure (CAMERA) is proposed based on the idea of estimating the inter-gene correlation from the data, and using it to adjust the gene set test statistic. An efficient procedure is developed for estimating the inter-gene correlation and characterizing its precision. CAMERA is shown to control the type I error rate correctly regardless of inter-gene correlations, yet retains excellent power for detecting genuine differential expression. Analysis of <span class="Disease">breast cancer data shows that CAMERA recovers known relationships between <span class="Disease">tumor subtypes in very convincing terms. CAMERA can be used to analyze specified sets or as a pathway analysis tool using a database of molecular signatures.Entities:
Mesh:
Year: 2012 PMID: 22638577 PMCID: PMC3458527 DOI: 10.1093/nar/gks461
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Histograms of P-values from different gene set tests in the absence of any true differential expression, but with a small inter-gene correlation in the test set. The simulation setup and order of test methods is as for Table 1. Test methods are (A) geneSetTest (mod t), (B) geneSetTest (ranks of mod t), (C) sigPathway, (D) PAGE, (E) CAMERA (modt) and (F) CAMERA (ranks of modt). Existing methods A–D give results highly skewed towards small and falsely significant P-values, whereas CAMERA gives uniformly distributed values.
Type I error rates of gene set tests when genes in the set are correlated
| Test method | Nominal | |||
|---|---|---|---|---|
| 0.01 | 0.02 | 0.05 | 0.10 | |
| geneSetTest (modt) | 0.2779 | 0.3275 | 0.4157 | 0.4950 |
| geneSetTest (ranks of modt) | 0.2826 | 0.3319 | 0.4144 | 0.4955 |
| sigPathway (t) | 0.2524 | 0.3025 | 0.3880 | 0.4704 |
| PAGE (logFC) | 0.2441 | 0.2900 | 0.3709 | 0.4503 |
| CAMERA (modt) | 0.0087 | 0.0187 | 0.0477 | 0.0990 |
| CAMERA (ranks of modt) | 0.0086 | 0.0173 | 0.0473 | 0.1003 |
CAMERA holds its size correctly whereas existing methods are highly liberal.
Entries are probabilities of rejecting the null hypothesis when conducting a gene set test to compare two groups of four arrays. Set size is 100 with inter-gene correlation 0.05. The remainder of 10 000 genes are uncorrelated. Results are based on 10 000 simulated data sets, so the standard error with which the error rate is estimated ranges from slightly < 0.001 (for rates near 0.01) to slightly < 0.005 (for rates near 0.5).
Correlation estimates are more precise than implied by the nominal chisquare approximation
| Correlation | Mean estimate | Empirical SD | Theoretical SD |
|---|---|---|---|
| 0 | −0.00007 | 0.00688 | 0.00698 |
| 0.02 | 0.0196 | 0.0117 | 0.0124 |
| 0.05 | 0.0490 | 0.0190 | 0.0206 |
| 0.1 | 0.0981 | 0.0300 | 0.0342 |
| 0.2 | 0.1961 | 0.0481 | 0.0614 |
Columns 2 and 3 give the mean and standard deviation (SD) of correlation estimates over 10 000 simulated data sets with set size of m = 40 and residual df d = 27. The empirical SDs are consistently less than the theoretical values. The simulation standard error with which the empirical SD is estimated is about 1.4%.
CAMERA has excellent power to detect sets with small but consistent expression fold-changes
| Cor | Percent DE genes | log2FC | df = 6 | df = 27 | ||
|---|---|---|---|---|---|---|
| Modt | Ranks | Modt | Ranks | |||
| 0 | 100 | 0.05 | 0.587 | 0.588 | 0.70 | 0.68 |
| 0 | 25 | 0.20 | 0.562 | 0.515 | 0.69 | 0.58 |
| 0.05 | 100 | 0.10 | 0.452 | 0.452 | 0.53 | 0.54 |
| 0.05 | 25 | 0.25 | 0.645 | 0.533 | 0.77 | 0.66 |
Columns 4–7 give probabilities of rejecting the null hypothesis at P <0.05. Set size is m = 100 with either 100% or 25% of genes in the set differentially expressed between two groups of four arrays. Residual df is either 6 or 27 depending on whether or not the experiment includes a third group of 22 arrays. Inter-gene correlation is either 0 or 0.05. ‘Mod-t’ and ‘Ranks’ refer to parametric and rank-based CAMERA procedures, respectively. Results based on 1000 simulated data sets for each scenario.
Figure 2.Inter-gene correlations for MSigDB gene sets in three microarray data sets. Top-left panel shows correlations. The other three panels show VIFs for breast cancers, human mammary epithelial cells and mouse hemapoeitic stem cells, respectively. The VIF plots show the cumulative distribution of VIFs over all gene sets. Solid and dotted horizontal lines show the mean and upper 95% quantile under the assumption of zero correlation.
Molecular signatures distinguishing basal-like from other breast cancer subtypes
| Gene set | N Genes | Correlation | Direction | FDR | |
|---|---|---|---|---|---|
| Smid_Breast_Cancer_Basal_Up | 580 | 0.039 | up | 1.2e-09 | 1.9e-08 |
| Doane_Breast_Cancer_Esr1_Up | 98 | 0.063 | down | 1.4e-09 | 1.9e-08 |
| Smid_Breast_Cancer_Basal_Dn | 569 | 0.035 | down | 1.5e-09 | 1.9e-08 |
| Vantveer_Breast_Cancer_Esr1_Up | 116 | 0.062 | down | 3.2e-09 | 3.2e-08 |
| Smid_Breast_Cancer_Relapse_In_Bone_Up | 85 | 0.044 | down | 1.9e-08 | 1.3e-07 |
| Benporath_Es_Core_Nine_Correlated | 95 | 0.057 | up | 1.9e-08 | 1.3e-07 |
| Smid_Breast_Cancer_Relapse_In_Brain_Up | 38 | 0.065 | up | 2.4e-08 | 1.3e-07 |
| Yang_Breast_Cancer_Esr1_Up | 24 | 0.122 | down | 3.0e-08 | 1.5e-07 |
| Smid_Breast_Cancer_Relapse_In_Brain_Dn | 68 | 0.059 | down | 3.7e-08 | 1.6e-07 |
| Doane_Breast_Cancer_Esr1_Dn | 46 | 0.099 | up | 8.9e-08 | 3.6e-07 |
| Yang_Breast_Cancer_Esr1_Bulk_Up | 15 | 0.109 | down | 3.0e-07 | 1.1e-06 |
| Smid_Breast_Cancer_Relapse_In_Bone_Dn | 281 | 0.044 | up | 3.8e-07 | 1.3e-06 |
| Vantveer_Breast_Cancer_Esr1_Dn | 195 | 0.069 | up | 4.1e-07 | 1.3e-06 |
| Vantveer_Breast_Cancer_Metastasis_Up | 37 | 0.051 | down | 8.0e-07 | 2.3e-06 |
| Benporath_Es_Core_Nine | 9 | 0.097 | up | 9.0e-07 | 2.4e-06 |
| Smid_Breast_Cancer_Luminal_B_Up | 144 | 0.049 | down | 3.2e-06 | 8.0e-06 |
| Doane_Breast_Cancer_Classes_Up | 58 | 0.095 | down | 4.7e-06 | 1.1e-05 |
| Smid_Breast_Cancer_Luminal_A_Dn | 16 | 0.174 | up | 4.9e-06 | 1.1e-05 |
| Yang_Breast_Cancer_Esr1_Bulk_Dn | 15 | 0.066 | up | 6.1e-06 | 1.3e-05 |
| Yang_Breast_Cancer_Esr1_Laser_Up | 24 | 0.053 | down | 8.1e-06 | 1.6e-05 |
| Yang_Breast_Cancer_Esr1_Dn | 19 | 0.175 | up | 1.1e-05 | 2.2e-05 |
| Benporath_Es_1 | 319 | 0.024 | up | 1.7e-05 | 3.2e-05 |
| Vecchi_Gastric_Cancer_Early_Up | 342 | 0.053 | up | 2.1e-05 | 3.6e-05 |
| Smid_Breast_Cancer_Relapse_In_Lung_Up | 21 | 0.051 | up | 3.2e-05 | 5.4e-05 |
| Sotiriou_Breast_Cancer_Grade_1_Vs_3_Dn | 40 | 0.064 | down | 3.4e-05 | 5.5e-05 |
| Landemaine_Lung_Metastasis | 15 | 0.139 | up | 4.5e-05 | 6.9e-05 |
| Lien_Breast_Carcinoma_Metaplastic_Vs_Ductal_Dn | 90 | 0.097 | down | 4.8e-05 | 7.1e-05 |
| Charafe_Breast_Cancer_Luminal_Vs_Basal_Up | 276 | 0.034 | down | 5.2e-05 | 7.4e-05 |
| Vantveer_Breast_Cancer_Metastasis_Dn | 92 | 0.119 | up | 7.1e-05 | 9.8e-05 |
| Pujana_Breast_Cancer_With_Brca1_Mutated_Up | 50 | 0.144 | up | 1.3e-04 | 1.7e-04 |
| Chiang_Liver_Cancer_Subclass_Proliferation_Up | 132 | 0.071 | up | 1.5e-04 | 1.9e-04 |
| Vantveer_Breast_Cancer_Brca1_Up | 27 | 0.039 | up | 2.1e-04 | 2.6e-04 |
| Naderi_Breast_Cancer_Prognosis_Up | 37 | 0.123 | up | 2.8e-04 | 3.2e-04 |
| Doane_Breast_Cancer_Classes_Dn | 31 | 0.072 | up | 2.8e-04 | 3.2e-04 |
| Smid_Breast_Cancer_Luminal_A_Up | 74 | 0.122 | down | 2.9e-04 | 3.2e-04 |
| Luminal progenitor up | 297 | 0.032 | up | 0.00012 | |
| Luminal progenitor down | 157 | 0.040 | down | 0.00049 | |
CAMERA results for the top 35 gene sets from the MSigDB when comparing basal-like cancers to the average of the other five subtypes. Output includes the size of each set, the estimated inter-gene correlation, two-sided P-value and FDR. Also given are results for mammary luminal progentor cell signatures.