Literature DB >> 28932104

Gene-Set Reduction for Analysis of Major and Minor Gleason Scores Based on Differential Gene-Set Expressions and Biological Pathways in Prostate Cancer.

Irina Dinu¹, Surya Poudel¹, Saumyadipta Pyne².

Abstract

The Gleason score (GS) plays an important role in prostate cancer detection and treatment. It is calculated based on a sum between its major and minor components, each ranging from 1 to 5, assigned after examination of sample cells taken from each side of the prostate gland during biopsy. A total GS of at least 7 is associated with more aggressive prostate cancer. However, it is still unclear how prostate cancer outcomes differ for various distributions of GS between its major and minor components. This article applies Significance Analysis of Microarray for Gene-Set Reduction to a real microarray study of patients with prostate cancer and identifies 13 core genes differentially expressed between patients with a major GS of 3 and a minor GS of 4, or (3,4), vs patients with a combination of (4,3), starting from a less aggressive GS combination of (3,3), and moving toward a more aggressive one of (4,4) via gray areas of (3,4) and (4,3). The resulting core genes may improve understanding of prostate cancer in patients with a total GS of 7, the most common grade and most challenging with respect to prognosis.

Entities: Chemical Disease Gene Species

Keywords: DNA microarrays; Gleason scores; core subsets; gene set analysis; gene set reduction

Year: 2017 PMID： 28932104 PMCID： PMC5598806 DOI： 10.1177/1176935117730016

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Prostate cancer is a heterogeneous disease, where some prostate cells no longer function as healthy cells, by losing normal control of growth and division. Gleason grading plays an important role in detection and treatment of prostate cancer.[1] In Gleason grading, the sample cells are taken from each side of prostate gland during the biopsy and then examined under a microscope by pathologist to determine whether cancer cells are present and to evaluate the microscopic features of any cancer found. A Gleason grade of 1 to 5 with decreasing differentiation is given to the prostate cancer based on the microscopic appearance of cancer cells in the prostate gland. A pathologist examines the biopsy specimen and attempts to give a score to the 2 patterns. The primary grade represents most of the tumor; it has to be greater than 50% of the total pattern seen. This is also called the major component of the Gleason score (GS). The secondary grade relates to the minority of the tumor; it has to be less than 50%, but at least 5% of the total pattern seen.[2] This is also called the minor component of the GS. Gleason score is calculated as the sum of the major (primary) and minor (secondary) components, therefore ranging from 2 to 10. Higher GSs are more aggressive and have a worse prognosis. It has been long recognized that patients with a total GS ≥7 are at greater risk for prostate cancer outcomes.[3] Although this finding has influenced clinical practice, it is still unclear how prostate cancer outcomes differ for various distributions of the total GS between its major and minor components. For example, within the GS of 7 patients, there are differences in outcomes between the patients with a combination of a major GS of 3 and minor of 4 and patients with a major GS of 4 and a minor of 3, with the former category exhibiting better outcomes.[4] Our goal is to identify genes and biological pathways’ expressions different between patients with a major GS of 3 and minor GS of 4, or (3,4), vs those with a major GS of 4 and minor GS of 3, or (4,3), starting from a less aggressive combination (3,3) and moving toward a more aggressive combination (4,4). Our strategy for analyzing microarray gene expression data is to focus on biological pathways, ie, sets of genes sharing a biological function. Results of gene-set analysis are easier to interpret than gene-level analysis and more robust across similar studies. Gene-set enrichment analysis was the first method proposed for analysis of sets of genes differentially expressed between 2 conditions. An intensive review and methodological discussions are given by Nam and Kim.[5] The methods are falling into 2 categories: competitive methods testing the strength of the association of a gene set with the phenotype against other sets of same sizes and self-contained methods testing the association of one set with the phenotype. Methods in both categories rely on a randomization testing approach to calculate significance and address the small sample size, large gene-set problem. Competitive methods use permutations based on gene sampling, whereas self-contained methods use permutations based on subject sampling. We prefer the latter because it preserves correlations across genes in a set. In this article, we use Significance Analysis of Microarray for Gene Sets (SAM-GS),[6] a method previously found to perform very well compared with 6 other self-contained methods. The performance was assessed in simulation studies of type I error and power, as well as applications to real data.[6,7] Another reason for using SAM-GS over other self-contained methods is its readily available extension. Significance Analysis of Microarray for Gene-Set Reduction (SAM-GSR)[8] is a method applied to extract core subsets, chiefly contributing to the significance of a set. The reasoning behind extracting core subsets is that not all the genes in a set contribute toward significance of a set. Significance Analysis of Microarray for Gene-Set Reduction identifies core subsets, by gradually retaining top-ranked genes and evaluating significance of the remaining subset. The ability of the method to identify core subsets was tested in simulations studies for a binary phenotype, as well as application to real microarray data.[8] The rest of the article is organized as follows. In section “Methods,” we describe the data from the Swedish Watchful Waiting Cohort, the gene sets and pathways catalog, as well as the 2 methods, SAM-GS and SAM-GSR. We also present our strategy of moving gradually from a less aggressive GS combination to a more aggressive one, to distinguish between patients with a major GS of 3 and minor GS of 4, vs patients with a major GS of 4 and a minor of 3. In sections “Results” and “Discussion,” we present the results and discuss their implications.

Methods

Individual gene analysis

Individual gene analysis is a method for gene expression analyses focusing on identifying individual genes that exhibit difference between 2 states of interest. In response to challenging characteristics of microarray data, Significant Analysis of Microarray (SAM)[9] was proposed as an individual gene analysis method. Significant Analysis of Microarray is a moderated t test statistic, together with a false discovery rate (FDR) type of adjustment, calculated based on group-label (eg, case-control label) permutation tests. The high dimensionality problem calls for permutation tests, which are the basis of calculating statistical significance of associations between a gene and the condition (eg, disease) of interest. Once a test statistic is calculated for the original data, its significance is evaluated by calculating the test statistic for permuted versions of the data set. Under the null hypothesis of no association, the group labels are interchangeable. The P value is calculated based on the permutation distribution of the test statistic, as the proportion of times the permuted test statistic is as extreme or more extreme than the observed test statistic. Significant Analysis of Microarray is based on analyses of random fluctuations in the data and computes gene-specific t-like tests. Although SAM is used for a wide variety of phenotypes, we focus on the binary phenotype here. The statistic measuring the relative difference in gene expression for gene i is given as follows: where is defined as the average level of expression for gene in the case group and is the average expression level for gene in the control group. The pooled standard deviation “gene-specific scatter” s(i) is as follows: where ; n1 and n2 are the numbers of cases and controls, respectively; and the small positive constant is added to adjust for the “small variability problem” in microarray measurements. The adjustment makes the variance of independent of the mean level of gene expression: at lower expression levels because values of could become very high due to very small values of . Adding a small positive constant to the denominator ensures that the variance of is independent of the mean level of gene expression.

Gene-set analysis

Analyzing microarray data at an individual gene level usually leads to a list of many “significant” genes, even after multiple comparison adjustments have been made. The process of trying to interpret such a large list of genes is difficult. Moreover, replication of the findings in different microarray experiments is another serious challenge with such individual gene-level analysis. Significance Analysis of Microarray for Gene Sets[6] combines the SAM t-like statistics of individual genes into a measure of association of the gene set with the phenotype. For a gene set S, it is the L2 norm of the t-like statistics described above: Statistical significance of S is obtained based on a phenotype label permutation test. The method can be summarized in a few steps: For each of the N genes, calculate the statistic d as in SAM for an individual gene analysis: where the “gene-specific scatter” s(i) is a pooled standard deviation over the 2 groups of the phenotype, and s0 is a small positive constant that adjusts for the small variability encountered in microarray data. Compute the SAM-GS test statistic corresponding to set S: Permute the labels of the phenotype and repeat steps (1) and (2). Repeat until all (or a large number of) permutations are considered. Statistical significance for the association of S and the phenotype is obtained by comparing the observed value of the SAM-GS statistic from step (2) and its permutation distribution from step (3).

Gene-set reduction

Significance Analyses of Microarray for Gene-Set Reduction proposed by Dinu et al[8] was motivated by the fact that not all genes in a significant set are contributing to its significance. Given a statistically significant association of the gene set S with the phenotype, SAM-GSR applies SAM-GS sequentially to subsets of the significant gene set S and identifies a core set of genes that mostly contribute to the statistical significance of S. In reducing the gene set S, we used the following principle: for a pair of genes in S, genes i and j, |d| > |d|, suggest that gene j belongs to a subset only if gene i belongs to the subset. This principle is motivated by the fact that represents each gene’s contribution to the test statistic SAM-GS, and the core subset must consist of genes with larger contributions. Significance Analyses of Microarray for Gene-Set Reduction gradually partitions the entire set S, into 2 subsets, based on the principle above and evaluates their association with the phenotype. Significance Analyses of Microarray for Gene-Set Reduction can be summarized in a few steps: For each of N genes, calculate the statistic d(i) as in SAM for an individual gene analyses: where is the average level of expression for gene in the case group, whereas is the average expression level for gene in the control group; is the pooled standard deviation of gene expression over the 2 groups of phenotype; is the small positive constant that adjusts for the small variability encountered in microarray data. For , select the first k genes with largest statistic |d| to form a reduced set R. Let c be the SAM-GS P value of the complement of R in S. The reduced set R corresponds to the least k such that c is larger than a threshold c, chosen by the analyst. By removing genes with joint statistical significance, as a set, above a threshold, ie, c > c, we are protected against losing genes that are not significant by themselves, but they collectively form a set that is significant.[8]

Results

Data description

We used data from the Swedish Watchful Waiting Cohort with up to 30 years of clinical follow-up.[10,11] The data are nested in a cohort of men with localized prostate cancer diagnosed in the Örebro (1997-1994) and South East (1987-1999) Health Care Regions of Sweden. Eligible patients were identified through population-based prostate cancer quality databases maintained in these regions, which were described in detail in the study by Johansson et al[12] The study cohort was followed for cancer-specific and all-cause mortality until March 1, 2006 through record linkages to the Swedish Death Register, which provided date of death or migration. Information on causes of death was obtained through a complete review of medical records by a study end point committee. Deaths were classified as cancer specific when prostate cancer was the primary cause of death. Sboner et al were able to trace tumor tissue specimens from 92% of all potentially eligible cases. Messenger RNA expression of 6100 genes was measured on 255 patients, divided into 2 extreme groups: men who died of prostate cancer and men who survived more than 10 years of follow-up without metastases. These 2 groups are referred as lethal and indolent patients with prostate cancer. Clinical, pathological, and demographical characteristics of the 255 patients are given in Table 1. Prostate-specific antigen is not available in this cohort, as there were no screening programs in place at the time.

Table 1.

Clinical, pathological, and demographical characteristics of the 255 patients.

Characteristics	Counts (%)	Extreme groups		Fisher exact test P value	Odds ratio (95% CI)
Characteristics	Counts (%)	Indolent	Lethal	Fisher exact test P value	Odds ratio (95% CI)
Gleason score
<7	77 (30.2)	52	25
7	104 (40.8)	46	58
>7	74 (29.0)	8	66	1.14*10⁻¹²
Gleason combinations
(3,3)	77 (37.5)	52	25
(3,4)	71 (34.6)	36	35
(4,3)	33 (16)	10	23
(4,4)	24 (11.7)	7	17	2.2*10⁻¹⁶
Age
≤70	77 (30.2)	39	38
>70	178 (69.8)	67	111	.07	1.7 (0.95-3.02)
Tumor area in biopsy, %
≤5	82 (32.2)	54	28
>5-25	88 (34.5)	39	49
>25-50	45 (17.6)	10	35
>50	35 (13.7)	2	33	9.02*10⁻¹¹
Not assessable	5 (2)
ERG rearrangement status (fusion)
Negative (0)	206 (80.8)	96	110
Positive (1)	40 (15.7)	5	35	3.64*10⁻⁵	6.07 (2.24-20.65)
Not assessable	9 (3.5)
Extreme groups
Lethal	149 (58.4)
Indolent	106 (41.6)
Survival status
Alive	71 (27.8)
Dead	184 (72.2)

Clinical, pathological, and demographical characteristics of the 255 patients.

Biological pathways and gene sets from Molecular Signatures Database

An important aspect of microarray data analysis is accessing extensive collections of gene sets and properly linking them to gene expression data. Microarray studies typically result in long lists of genes, not always easy to interpret. Scientists put together lists of genes sharing a common biological function, ie, biological pathways. The analysis at the gene-set or pathway level improves on interpretation and reproducibility across studies. The Molecular Signatures Database (MSigDB)[13] available for download from http://www.broad.mit.edu/gsea is one of the most widely used repositories of knowledge expert–derived sets of genes and biological pathways. A growing number of databases store sets from gene expression signatures reported in the literature. Molecular Signatures Database differs from these resources in several aspects: (1) the catalog is formatted for gene-set analysis; (2) it covers a more diverse and wider range of gene-set resources and types, including original research publications and entire collections of sets derived from specialized resources; (3) MSigDB is built both through manual curation and by automatic computational means, whereas other databases emphasize only one of these approaches; and (4) the collection contains the largest number of gene sets overall. For our analyses, we used the MSigDB C2 catalog consisting of 1892 gene sets, representing metabolic and signaling pathways from online pathway databases, gene sets from biomedical literature including 786 scientific publications, gene sets compiled from published mammalian microarray studies, and gene sets defined by mining large collections of cancer-oriented microarray data.

Gene-set reduction results for GS ranging from (3,3) to (4,4)

Data analyses started by validating a strong signal in our data at the level of lethal vs nonlethal patients with prostate cancer. In total, 1351 genes out of 1892 MSigDB gene sets were found to be differentially expressed between 149 lethal and 106 nonlethal patients with prostate cancer, using SAM-GS. Furthermore, 1246 gene sets were found to be differentially expressed between 80 patients with major and minor GSs ≤3 vs 68 patients with major and minor GS ≥4. Our goal was to compare biological pathways and gene sets across various combinations of major and minor GS components. There might be some overlapping sets differentiating across various combinations. However, we did not hypothesize that the same groups of genes would differentiate across all the combinations. Therefore, a union of unique core genes from all the combinations analyses is reported in Figure 1. The number of significant gene sets and core set sizes decreased considerably when comparing patients with larger total GS, indicating a challenge in discriminating between higher risk groups of patients. For example, a comparison of 77 patients with GS of (3,3) vs 62 patients with GS of (3,4) gives 369 gene sets significant at a P value of .05/4 = .0125. The Bonferroni adjustment corresponds to a total of 4 GS combinations, as described in Figure 1. Eight gene sets are differentially expressed between GS of (3,4) vs (4,4), and only one gene set differentiates between (4,3) and (4,4). The FDR[14] is provided as a measure of adjustment for testing a large number of genes and is given by the expected proportion of false positives among all tests called significant. The FDR cutoffs for the 4 combinations are 0.006, 0.004, 0.27, and 0.95.

Figure 1.

Gene-set reduction flowchart. SAM-GSR indicates Significance Analyses of Microarray for Gene-Set Reduction.

Gene-set reduction flowchart. SAM-GSR indicates Significance Analyses of Microarray for Gene-Set Reduction. Significance Analyses of Microarray for Gene-Set Reduction achieved a 91% reduction, averaged over the 4 GS combinations, starting from (3,3) and ending with (4,4). The 369 gene sets differentiating between (3,3) and (3,4) were reduced to 332 unique genes shared across the core gene sets. The percent reduction was calculated for each gene set as the number of genes outside the core set divided by the size of the gene set and multiplied by 100. The percent reduction is averaged over the significant gene sets. The overall average percent reduction across combinations ranging from (3,3) to (4,4) was 91%. Moving from a less aggressive GSs combination (3,3) to a more aggressive combination (4,4), 580 unique genes were identified. At the gene set–level analysis, only 1 of the 8 pathways differentiating between (3,4) vs (4,4) is represented among the 369 pathways differentiating between (3,3) vs (3,4). Negative log P values according to the 2 analyses are shown in Figure 2. The 8 pathways are represented as letters of the alphabet from A to H. Similarly, only 1 of the 8 pathways differentiating between (3,4) vs (4,4) is represented among the 389 pathways differentiating between (3,3) vs (4,3) (Figure 3).

Figure 2.

Negative log P values for gene sets differentially expressed between (3,4) vs (4,4) or (3,3) vs (3,4).

The 8 gene sets differentiating between (3,4) vs (4,4) are denoted as letters of alphabet as shown below.

Figure 3.

Negative log P values for gene sets differentially expressed between (3,4) vs (4,4) or (3,3) vs (4,3).

The 8 gene sets differentiating between (3,4) vs (4,4) are denoted as letters of alphabet as shown below.

Negative log P values for gene sets differentially expressed between (3,4) vs (4,4) or (3,3) vs (3,4). The 8 gene sets differentiating between (3,4) vs (4,4) are denoted as letters of alphabet as shown below. Negative log P values for gene sets differentially expressed between (3,4) vs (4,4) or (3,3) vs (4,3). The 8 gene sets differentiating between (3,4) vs (4,4) are denoted as letters of alphabet as shown below. There were 179 gene sets overlapping across the analyses of (3,3) vs (3,4) and (3,3) vs (4,3). At the gene level, there were 84 overlapping genes across the core genes differentiating between (3,3) vs (3,4) and (3,3) vs (4,3). These results are presented as Supplementary Material.

Gene-set reduction results for GS of (3,4) vs (4,3)

We performed a gene-set analysis and reduction for 62 patients with GS of (3,4) vs 46 patients with GS of (4,3). In total, 32 gene sets were identified at .05 significance level, with an FDR value of 0.75. The core sets of the 32 gene sets are presented in Table 2.

Table 2.

Results of SAM-GS and SAM-GSR analyses for 62 patients with GS of (3,4) vs 46 patients with GS of (4,3), together with P values from a global gene set analysis of GSs of 6, 7, and 8.

Gene-set name	Gene-set size	SAM-GS, P value	Global analysis of GSs of 6, 7, and 8	SAM-GSR core genes
AGED_MOUSE_HYPOTH_DN	28	.002	.002	DNM1 FSTL1 APOE
CD40PATHWAY[a]	9	.008	.365	IKBKAP
HSA05110_CHOLERA_INFECTION	23	.011	.027	SEC61A1
HEATSHOCK_YOUNG_UP	9	.016	<.001	ANXA1
NOUZOVA_CPG_METHLTD	22	.018	<.001	EFNA5 EPHA5
VEGF_HUVEC_2HRS_UP[a]	25	.018	.274	APOE PPY
HYPOPHYSECTOMY_RAT_DN	39	.021	<.001	COL3A1 NPPA
PENG_GLUCOSE_UP	32	.022	<.001	OCLN
LIAN_MYELOID_DIFF_TF	31	.022	.015	BHLHB2 MYB NFKB1
HSA00330_ARGININE_AND_	25	.023	.212	ARG2
PROLINE_METABOLISM[a]
ADIPOGENESIS_HMSC_	6	.025	.285	MYB
CLASS5_UP[a]	6	.025	.285	MYB
ONE_CARBON_POOL_BY_FOLATE	15	.028	.041	SHMT2
TNFR2PATHWAY	14	.029	.019	IKBKAP
UVC_HIGH_D9_DN	20	.03	.039	NAP1L1
HDACI_COLON_CLUSTER6[a]	24	.031	.317	NAP1L1
NDKDYNAMINPATHWAY[a]	15	.032	.112	DNM1
TYPE_III_SECRETION_SYSTEM	14	.034	.011	ATP6V1C1
ANDROGEN_GENES	43	.036	.013	NR1I3
GH_HYPOPHYSECTOMY_RAT_UP	10	.036	.042	COL3A1
ARGININE_AND_PROLINE_	42	.04	.001	MAOA
METABOLISM
FMLPPATHWAY[a]	30	.04	.103	NFATC3
HSA00670_ONE_CARBON_	13	.04	.031	SHMT2
POOL_BY_FOLATE
PHOTOSYNTHESIS	15	.041	.016	ATP6V1C1
HSA00051_FRUCTOSE_AND_	28	.041	.005	MTMR6
MANNOSE_METABOLISM
KIM_TH_CELLS_UP[a]	31	.044	.124	ETS1
GCRPATHWAY	16	.044	.001	ANXA1
HEARTFAILURE_ATRIA_UP	20	.045	.051	FKBP8
ALZHEIMERS_INCIPIENT_DN	88	.046	<.001	UROS
GAMMA.UV_FIBRO_UP	25	.046	.005	IL10RB
AGUIRRE_PANCREAS_CHR8	28	.047	.002	HAS2
GH_GHRHR_KO_24HRS_DN	73	.047	.013	IFNAR1
FERRANDO_CHEMO_	9	.048	.329	DTYMK
RESPONSE_PATHWAY[a]

Abbreviations: GS, Gleason score; SAM-GS, Significance Analysis of Microarray for Gene Sets; SAM-GSR, Significance Analyses of Microarray for Gene-Set Reduction.

Gene sets not significant in the global analysis of GS of 6, 7, and 8, although significant in the analysis of (3,4) vs (4,3) combinations.

Results of SAM-GS and SAM-GSR analyses for 62 patients with GS of (3,4) vs 46 patients with GS of (4,3), together with P values from a global gene set analysis of GSs of 6, 7, and 8. Abbreviations: GS, Gleason score; SAM-GS, Significance Analysis of Microarray for Gene Sets; SAM-GSR, Significance Analyses of Microarray for Gene-Set Reduction. Gene sets not significant in the global analysis of GS of 6, 7, and 8, although significant in the analysis of (3,4) vs (4,3) combinations. We compared the results of analysis of GS (3,4) vs (4,3) with results of analysis of GS ranging from (3,3) to (4,4). Significance Analysis of Microarray for Gene Sets P values of the 8 gene sets differentiating between (3,4) and (4,4) are presented in Table 3.

Table 3.

SAM-GS P values for various distributions of Gleason scores, together with P values from a global gene set analysis of Gleason scores of 6, 7, and 8.

Gene-set name	Gene-set size	(3,3) vs (3,4)	(3,3) vs (4,3)	(3,4) vs (4,4)	(4,3) vs (4,4)	(3,4) vs (4,3)	Global analysis of Gleason scores 6, 7, and 8
BUT_TSA_UP	18	.179	.254	.008	.174	.24	.047
CMV_HCMV_	36	.047	.046	.007	.069	.574	.005
TIMECOURSE_14HRS_DN
FERRANDO_CHEMO_RESPONSE_PATHWAY	9	.042	.016	.01	.045	.048	.329
HDACI_COLON_CUR24HRS_UP	27	.005	.2	.01	.069	.383	.024
LEE_CIP_UP	50	.088	.076	.01	.066	.834	.002
TSA_PANC50_UP	29	.128	.048	.003	.029	.346	.001
UEDA_MOUSE_SCN	58	.05	.001	.005	.228	.15	.011
UREACYCLEPATHWAY	7	.721	.536	.001	.016	.07	.155

Abbreviation: SAM-GS, Significance Analysis of Microarray for Gene Sets.

SAM-GS P values for various distributions of Gleason scores, together with P values from a global gene set analysis of Gleason scores of 6, 7, and 8. Abbreviation: SAM-GS, Significance Analysis of Microarray for Gene Sets. At the gene set–level analysis, only 1 of the 8 pathways differentiating between (3,4) vs (4,4) is represented among the 32 pathways differentiating between (3,4) vs (4,3). Negative log P values according to the 2 analyses are shown in Figure 4. The 8 pathways are represented as letters of the alphabet from A to H.

Figure 4.

Negative log P values for gene sets differentially expressed between (3,4) vs (4,4) or (3,4) vs (4,3).

The 8 gene sets differentiating between (3,4) vs (4,4) are denoted as letters of alphabet as shown below.

Negative log P values for gene sets differentially expressed between (3,4) vs (4,4) or (3,4) vs (4,3). The 8 gene sets differentiating between (3,4) vs (4,4) are denoted as letters of alphabet as shown below. At the gene-level analysis, none of the 13 core genes from comparing (3,4) vs (4,4) are represented among the 332 core genes comparing (3,3) vs (3,4) or among the 323 core genes comparing (3,3) vs (4,3). The 13 core genes are shown in Table 4. Boxplots of some of these core gene expressions are presented in Figure 5. Although the boxplots show small differences, we need to keep in mind that the concept of gene set analysis was developed to address small but coordinated changes in gene expressions, across the set. The correlations across a gene set or biological pathway drive the association with the phenotype, even if the changes at the individual gene level are small.[5,6] Biological process and cellular component from Gene Ontology for core genes are presented in Table 5. The set consisting of the 13 genes shows a marginal association with GS of (3,4) vs (4,3), with a SAM-GS P value of .059.

Table 4.

SAM-GS and SAM-GSR analyses for 62 patients with Gleason score of (3,4) vs 12 patients with Gleason score of (4,4).

Gene-set name	Gene-set size	P value	Core set size	Core genes
BUT_TSA_UP	18	.008	1	GADD45A
CMV_HCMV_	36	.007	2	ETV1 APEX1
TIMECOURSE_14HRS_DN	36	.007	2	ETV1 APEX1
FERRANDO_CHEMO_RESPONSE_PATHWAY	9	.01	1	CDA
HDACI_COLON_	27	.01	3	RPN2 ALDOA CCND1
CUR24HRS_UP
LEE_CIP_UP	50	.01	2	ETV1	COL4A2
TSA_PANC50_UP	29	.003	2	BIK	NOTCH3
UEDA_MOUSE_SCN	58	.005	2	GADD45A	SMPDL3A
UREACYCLEPATHWAY	7	.001	2	CPS1	ASL

Abbreviations: SAM-GS, Significance Analysis of Microarray for Gene Sets; SAM-GSR, Significance Analyses of Microarray for Gene-Set Reduction.

Figure 5.

Boxplots of core gene expressions among the 13 genes differentiating between GS of (3,4) and GS of (4,4). GS indicates Gleason score.

Table 5.

Biological process and cellular component from Gene Ontology for core genes from SAM-GSR analyses for 62 patients with GS of (3,4) vs 12 patients with GS of (4,4).

Core gene name	Biological process	Cellular component
ETV1	Cell growth, angiogenesis, migration, proliferation, and differentiation	Nucleus
GADD45A	Cell cycle arrest	Nucleus, cytoplasm
ALDOA [a]	Fructose and glucose metabolic process	Nucleus, cytosol
APEX1	Mitotic cell cycle	Nucleus, cytoplasm
ASL	Urea cycle, cellular nitrogen compound, metabolic process	Cytoplasm, cytosol
BIK	Apoptotic	Endomembrane system
CCND1	Transition of mitotic cell cycle	Nucleus, cytosol
CDA	Pyrimidine nucleobase metabolic process, cell surface receptor signaling pathway	Extracellular region, cytosol
COL4A2 [a]	Angiogenesis, endodermal cell differentiation, cellular response to transforming growth factor β stimulus	Extracellular region
CPS1	Urea cycle, glutamine metabolic process	Nucleus, cytoplasm, mitochondrial inner membrane
NOTCH3	Notch signaling pathway, negative regulation of neuron differentiation	Nucleoplasm, cytoplasm, extracellular region
RPN2 [a]	Translation, cellular protein modification process, cellular protein metabolic process, response to drug, posttranslational protein modification	Autophagosome membrane, nucleus, integral component of membrane
SMPDL3A [a]	Sphingomyelin catabolic process	Extracellular space, extracellular exosome

Abbreviation: GS, Gleason score; SAM-GSR, Significance Analyses of Microarray for Gene-Set Reduction.

Genes not identified as significant in SAM-GSR analysis of patients with GS of 6 vs GS of 7 or GS of 7 vs GS of 8.

SAM-GS and SAM-GSR analyses for 62 patients with Gleason score of (3,4) vs 12 patients with Gleason score of (4,4). Abbreviations: SAM-GS, Significance Analysis of Microarray for Gene Sets; SAM-GSR, Significance Analyses of Microarray for Gene-Set Reduction. Boxplots of core gene expressions among the 13 genes differentiating between GS of (3,4) and GS of (4,4). GS indicates Gleason score. Biological process and cellular component from Gene Ontology for core genes from SAM-GSR analyses for 62 patients with GS of (3,4) vs 12 patients with GS of (4,4). Abbreviation: GS, Gleason score; SAM-GSR, Significance Analyses of Microarray for Gene-Set Reduction. Genes not identified as significant in SAM-GSR analysis of patients with GS of 6 vs GS of 7 or GS of 7 vs GS of 8. We also performed a global analysis of GSs of 6, 7, and 8. The results of the global analysis are shown in Tables 2 and 3. The global analysis resulted in 66% of the gene sets with P values less than or equal to .05 and 25% less than or equal to .001. This supports previous knowledge that GS of 7 and above are significantly different from GS of 6. However, the breakdown by major and minor components is needed to sort out the groups of patients where the differences occur. Most of the gene sets show significant P values in the global analysis, in agreement with differences occurring at some of the major and minor combinations. However, we note that some of the gene sets show lack of significance in the global analysis, despite significance in some of the analysis of the major and minor combinations. In Table 2, 9 out of 32 gene sets that are significantly different between (3,4) and (4,3) are not significant in the global analysis of multiple GSs. In Table 3, 2 out of 8 gene sets did not reach significance level in the global analysis, although they appear different in some of the analyses of the major and minor combinations. These differences may be caused by the fact that the overall scores are collapsed over major and minor combinations and, after further validation, may provide some insights into the 3 + 4 ≠ 4 + 3 prostate cancer hypothesis.

Discussion

Gleason score plays an important role in prostate cancer diagnostic and treatment. The current practice indicates patients with a total GS of 7 or larger to be at higher risk. It has been recognized in the literature that the representation of the total GS into its major and minor components plays an important role in understanding severity of the disease, with patients exhibiting a GS combination of (4,3) being at higher risk than those with a GS combination of (3,4). We studied differences at the gene and gene-set levels between patients with various combinations of major and minor GSs, moving from a less aggressive combination of (3,3) and toward a more aggressive combination of (4,4). We note that groups of patients within this GS range are expected to exhibit subtle changes, especially at the gene level. Significance Analysis of Microarray for Gene Sets is a powerful method for detecting subtle and coordinated changes in microarray gene expression data. Gene-set analysis was developed in response to moderate to weak signal at the gene level. The key element in gene-set analysis is to take advantage of correlations across genes in a set, therefore boosting the analysis power. Significance Analysis of Microarray for Gene Sets was found to perform well in comparative studies of 7 self-contained gene-set analysis methods.[8] One of the weaknesses of self-contained methods is that only a few genes in a set can drive the significance of the whole set. Significance Analysis of Microarray for Gene Set Reduction was designed to extract core genes that contribute to the significance of the whole set. We reason that these 2 methods are appropriate for analyzing differences at gene and gene-set levels across various combinations of GSs. Some of the gene sets and pathways identified significant in our analyses have been previously found to play various roles in cancer progression and identification of novel therapeutic strategies. For example, the CD40 pathway differentially expressed between GS of (3,4) vs (4,3) has been shown to play an immunosuppressive role.[15] The CD40 pathway has also been shown to play a crucial role in production of cytokines, which modulate the function of T lymphocytes in antitumor responses.[16] TNFR2 pathway was also differentially expressed between GS of (3,4) vs (4,3). TNFR2 is a receptor of tumor necrosis factor, a multifunctional pro-inflammatory cytokine. Members of the tumor necrosis factor receptor superfamily can send both survival and death signals to cells.[17] Urea cycle pathway was differentially expressed between GS of (3,4) vs (4,4), P value of .001, and GS of (4,3) vs (4,4), P value of .016; marginally significant for GS of (3,4) vs (4,3), P value of .07; and not significant for GS of (3,3) vs (3,4), P value of .721, or (3,3) vs (4,3), P value of .536. In urea cycle pathway, the enzyme ornithine decarboxylase converts the metabolite ornithine to putrescine. Ornithine decarboxylase has previously been found as overexpressed in prostate cancer[18] and is the target of the chemotherapeutic agent difluoromethylornithine.[19]

18 in total

1. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

2. Molecular sampling of prostate cancer: a dilemma for predicting disease progression.

Authors: Andrea Sboner; Francesca Demichelis; Stefano Calza; Yudi Pawitan; Sunita R Setlur; Yujin Hoshida; Sven Perner; Hans-Olov Adami; Katja Fall; Lorelei A Mucci; Philip W Kantoff; Meir Stampfer; Swen-Olof Andersson; Eberhard Varenhorst; Jan-Erik Johansson; Mark B Gerstein; Todd R Golub; Mark A Rubin; Ove Andrén
Journal: BMC Med Genomics Date: 2010-03-16 Impact factor: 3.063

Review 3. Gene-set approach for expression pattern analysis.

Authors: Dougu Nam; Seon-Young Kim
Journal: Brief Bioinform Date: 2008-01-17 Impact factor: 11.622

4. Molecular signatures database (MSigDB) 3.0.

Authors: Arthur Liberzon; Aravind Subramanian; Reid Pinchback; Helga Thorvaldsdóttir; Pablo Tamayo; Jill P Mesirov
Journal: Bioinformatics Date: 2011-05-05 Impact factor: 6.937

5. Gleason score 7 prostate cancer on needle biopsy: is the prognostic difference in Gleason scores 4 + 3 and 3 + 4 independent of the number of involved cores?

Authors: Danil V Makarov; Harriete Sanderson; Alan W Partin; Jonathan I Epstein
Journal: J Urol Date: 2002-06 Impact factor: 7.450

6. Delineation of prognostic biomarkers in prostate cancer.

Authors: S M Dhanasekaran; T R Barrette; D Ghosh; R Shah; S Varambally; K Kurachi; K J Pienta; M A Rubin; A M Chinnaiyan
Journal: Nature Date: 2001-08-23 Impact factor: 49.962

7. Identification of genes that function in the TNF-alpha-mediated apoptotic pathway using randomized hybrid ribozyme libraries.

Authors: Hiroaki Kawasaki; Reiko Onuki; Eigo Suyama; Kazunari Taira
Journal: Nat Biotechnol Date: 2002-04 Impact factor: 54.908

8. Gene expression profiling identifies clinically relevant subtypes of prostate cancer.

Authors: Jacques Lapointe; Chunde Li; John P Higgins; Matt van de Rijn; Eric Bair; Kelli Montgomery; Michelle Ferrari; Lars Egevad; Walter Rayford; Ulf Bergerheim; Peter Ekman; Angelo M DeMarzo; Robert Tibshirani; David Botstein; Patrick O Brown; James D Brooks; Jonathan R Pollack
Journal: Proc Natl Acad Sci U S A Date: 2004-01-07 Impact factor: 11.205

Gene-Set Reduction for Analysis of Major and Minor Gleason Scores Based on Differential Gene-Set Expressions and Biological Pathways in Prostate Cancer.

Introduction

Methods

Individual gene analysis

Gene-set analysis

Gene-set reduction

Results

Data description

Biological pathways and gene sets from Molecular Signatures Database

Gene-set reduction results for GS ranging from (3,3) to (4,4)

Gene-set reduction results for GS of (3,4) vs (4,3)

Discussion

1. Statistical significance for genomewide studies.

2. Molecular sampling of prostate cancer: a dilemma for predicting disease progression.

Review 3. Gene-set approach for expression pattern analysis.

4. Molecular signatures database (MSigDB) 3.0.

5. Gleason score 7 prostate cancer on needle biopsy: is the prognostic difference in Gleason scores 4 + 3 and 3 + 4 independent of the number of involved cores?

6. Delineation of prognostic biomarkers in prostate cancer.

7. Identification of genes that function in the TNF-alpha-mediated apoptotic pathway using randomized hybrid ribozyme libraries.

8. Gene expression profiling identifies clinically relevant subtypes of prostate cancer.

9. Natural history of early, localized prostate cancer.

10. Antitumor and antimetastatic activity of interleukin 12 against murine tumors.

1. In-Silico Integration Approach to Identify a Key miRNA Regulating a Gene Network in Aggressive Prostate Cancer.