| Literature DB >> 19772589 |
Phillip D Yates1, Mark A Reimers.
Abstract
BACKGROUND: Gene sets are widely used to interpret genome-scale data. Analysis techniques that make better use of the correlation structure of microarray data while addressing practical "n<p" concerns could provide a real increase in power. However correlation structure is hard to estimate with typical genomics sample sizes. In this paper we present an extension of a classical multivariate procedure that confronts this challenge by the use of a regularized covariance matrix.Entities:
Mesh:
Year: 2009 PMID: 19772589 PMCID: PMC3087342 DOI: 10.1186/1471-2105-10-300
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
19] is used for assessing significance. 10,000 permutations of the observation phenotype labels were used to determine the significance of the regularized covariance multivariate test. At each permutation step a new α value was selected for the shuffled data. Despite a two-sided hypothesis the use of a quadratic form for our test statistic requires a one-sided rejection region. If one elects to test more than one pathway, i.e., a set of gene sets, then one can apply a multiple comparison procedure to attempt to control for the overall false discovery rate.
7] via simulation. Apart from RCMAT's use of a regularized covariance estimate the method of Kong et al. is a close Hotelling's T2 parallel to RCMAT. The Kong et al. method was independently presented as PCOT2 in [8]. We first examine the distribution of p-values when no difference exists between the groups in the averages of expression measures of genes in the pathway, i.e., the null hypothesis case. In a two-group null comparison each p-value between 0 and 1 is equally likely. Figure 1 depicts the distribution of p-values for both RCMAT and the procedure of Kong et al. when no difference is present between the two groups. Each plot graphs the 100 ranked p-values for each of the six settings in a uniform QQ-plot. The number of genes in a pathway was either 10 or 30 and the within-group sample size was 10, 20, or 50 for both groups. Both methods exhibit a somewhat conservative bias relative to the expected p-value.
Figure 1Null distribution p-values. Null distribution p-values for both the RCMAT and the method of Kong et al. under the null hypothesis (no average difference). P-values for 100 simulated data sets within each of six conditions are given. Along the vertical axis is the expected p-value under the assumption of no difference between the two phenotypes; on the horizontal axis is the corresponding actual p-value obtained from the simulation.
Figure 2Cumulative distribution function of RCMAT nominal p-values under several simulated non-null conditions. Under each of 12 selected conditions 100 simulation experiments were performed and permutation p-values obtained. The vertical axis is the cumulative distribution function, the proportion of values less than the observed p-value, for the 100 simulated data sets within a condition. A vertical line corresponding to a 0.05 nominal p-value is also provided.
Figure 3Cumulative distribution function of a ratio of RCMAT and Kong . For 12 non-null conditions 100 simulation experiments were performed and permutation p-values obtained for both the RCMAT and the method of Kong et al. The logarithm base ten of the RCMAT p-value/Kong et al. p-value ratio is listed on the horizontal axis. The vertical axis is the cumulative distribution function, the proportion of values less than the observed ratio, for the 100 simulated data sets within a condition.
Comparison of RCMAT with the procedure of Kong et al.
| c25_U133_probes | 64 | 0.0003 | 0.0039 |
| MAP00600_Sphingoglycolipid_metabolism | 18 | 0.0018 | 0.0036 |
| MAP00300_Lysine_biosynthesis | 5 | 0.002 | 0.0089 |
| MAP00561_Glycerolipid_metabolism | 84 | 0.0028 | 0.7803 |
| c29_U133_probes | 202 | 0.0033 | 0.0672 |
| c33_U133_probes | 362 | 0.0034 | 0.1201 |
| c23_U133_probes | 109 | 0.0035 | 0.1581 |
| MAP00360_Phenylalanine_metabolism | 23 | 0.0036 | 0.0723 |
| MAP00531_Glycosaminoglycan_degradation | 18 | 0.0043 | 0.0005 |
| MAP00511_N_Glycan_degradation | 9 | 0.0072 | 0.0066 |
| GLUCO_HG-U133A_probes | 46 | 0.0084 | 0.4585 |
| GLYCOL_HG-U133A_probes | 31 | 0.0088 | 0.5699 |
| MAP00910_Nitrogen_metabolism | 31 | 0.0094 | 0.0385 |
| MAP00430_Taurine_and_hypotaurine_metabolism | 12 | 0.01 | 0.0888 |
| mitochondr_HG-U133A_probes | 615 | 0.0107 | 0.05 |
| MAP00650_Butanoate_metabolism | 38 | 0.0109 | 0.3458 |
| human_mitoDB_6_2002_HG-U133A_probes | 594 | 0.0113 | 0.0381 |
| c28_U133_probes | 288 | 0.0123 | 0.1947 |
| MAP00252_Alanine_and_aspartate_metabolism | 35 | 0.0131 | 0.0472 |
| c20_U133_probes | 270 | 0.0139 | 0.1125 |
| MAP00190_Oxidative_phosphorylation | 75 | 0.0141 | 0.2173 |
| c22_U133_probes | 194 | 0.0152 | 0.016 |
| MAP00710_Carbon_fixation | 27 | 0.0152 | 0.0297 |
| MAP00340_Histidine_metabolism | 32 | 0.0154 | 0.2045 |
| MAP00330_Arginine_and_proline_metabolism | 63 | 0.0168 | 0.0062 |
| c31_U133_probes | 346 | 0.0172 | 0.3197 |
| MAP00380_Tryptophan_metabolism | 88 | 0.018 | 0.7238 |
| MAP00380_Tryptophan_metabolism~ | 88 | 0.0195 | 0.7156 |
| c7_U133_probes | 349 | 0.0207 | 0.1292 |
| c15_U133_probes | 264 | 0.0232 | 0.4323 |
| MAP00512_O_Glycans_biosynthesis | 15 | 0.0236 | 0.0322 |
| MAP00970_Aminoacyl_tRNA_biosynthesis | 34 | 0.024 | 0.1054 |
| c27_U133_probes | 266 | 0.0253 | 0.1722 |
| MAP00251_Glutamate_metabolism | 35 | 0.0256 | 0.0259 |
| c12_U133_probes | 251 | 0.0263 | 0.087 |
| MAP00031_Inositol_metabolism | 7 | 0.0265 | 0.0677 |
| MAP00410_beta_Alanine_metabolism | 27 | 0.0291 | 0.5854 |
| c34_U133_probes | 452 | 0.0311 | 0.2366 |
| c11_U133_probes | 192 | 0.0334 | 0.4341 |
| c18_U133_probes | 248 | 0.0335 | 0.0167 |
| MAP00590_Prostaglandin_and_leukotriene_metabolism | 34 | 0.0348 | 0.1956 |
| c14_U133_probes | 302 | 0.0361 | 0.1327 |
| c35_U133_probes | 470 | 0.0419 | 0.2794 |
| OXPHOS_HG-U133A_probes | 114 | 0.0441 | 0.1705 |
| ROS_HG-U133A_probes | 9 | 0.0446 | 0.1523 |
| c3_U133_probes | 267 | 0.0455 | 0.5362 |
| c30_U133_probes | 239 | 0.0462 | 0.1006 |
| GO_0005739_HG-U133A_probes | 227 | 0.0467 | 0.3106 |
| MAP00310_Lysine_degradation | 35 | 0.0477 | 0.4446 |
| FA_HG-U133A_probes | 34 | 0.0485 | 0.1047 |
For each of the gene sets from Mootha et al. [3] both the RCMAT and the method of Kong et al. were applied. Nominal (unadjusted) permutation p-values for each of the two procedures are given. The number of genes in the pathway is also provided.
24] use the single largest metagene, obtained with a singular value decomposition of expression values of genes in the group, to compare two phenotype groups. In a related extension Kong et al. [7] use a singular value decomposition to locate a reduced gene subspace defined by the eigenvectors whose corresponding eigenvalues exceed a small positive number. However the directions of the subspace corresponding to smaller eigenvalues of Σ can be poorly estimated. We conjecture that RCMAT is more powerful relative to the procedure of Kong et al. since RCMAT does not restrict the magnitude of the phenotypic transcription differences included and it reduces the noise in the estimate of the covariance matrix, which is inverted. A degree of caution is still advised - highly unstable or p > >n gene set covariance estimators may be heavily biased by RCMAT due to the need for a large amount of regularization.