| Literature DB >> 18439262 |
Jonathan D Mosley1, Ruth A Keri.
Abstract
BACKGROUND: Numerous gene lists or "classifiers" have been derived from global gene expression data that assign breast cancers to good and poor prognosis groups. A remarkable feature of these molecular signatures is that they have few genes in common, prompting speculation that they may use distinct genes to measure the same pathophysiological process(es), such as proliferation. However, this supposition has not been rigorously tested. If gene-based classifiers function by measuring a minimal number of cellular processes, we hypothesized that the informative genes for these processes could be identified and the data sets could be adjusted for the predictive contributions of those genes. Such adjustment would then attenuate the predictive function of any signature measuring that same process.Entities:
Year: 2008 PMID: 18439262 PMCID: PMC2396170 DOI: 10.1186/1755-8794-1-11
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Figure 1Identification of predictive, correlated gene clusters in a simulated data set. A simulated expression data set that included 3 independent correlated gene clusters, two of which contained a gene associated with outcome (gene1 and gene2), as well as an additional set of uncorrelated probes was generated as described in the Methods. Each figure shows data for 901 simulated genes with a univariate hazard ratio (HR) p-value less than 0.5. Each graph is a scatter plot of the negative log of the p-value of the univariate HR for a gene versus its correlation to a principal component (PC) variable. The PC variable was derived from the expression values of the top 5 ranked genes representing the most predictive correlated gene cluster identified in the current iteration. A large value on the y-axis corresponds to a small p-value, indicating that a gene is strongly associated with outcome. A. The first correlated set of predictive genes identified on an analysis of unadjusted expression data. B. The second set of correlated genes identified after the expression data were adjusted for the first PC identified in graph (A). HR p-values were computed using the adjusted data. C. A third set of correlated genes was revealed after the data were sequentially adjusted for the PC variables identified in (A) and (B). Note that no additional correlated clusters of genes were identified that had small p-values, indicating that the 2 PC variables represented the two major clusters of genes predictive of outcome in the simulated data set, thereby confirming the efficacy of this approach.
Figure 2Scatter plots showing the two most predictive clusters of correlated probes in three independent breast cancer gene expression data sets. Each graph is a scatter plot of the negative log of the p-value for a univariate HR versus the correlation of each probe to the first PC variable derived from the expression values of the top 10 ranked probes. For each data set, only probes which had a univariate HR p-value of less than 0.05 for 5 year metastatic recurrence latencies were examined. A. Scatter plots for the 3,311 probes in the NKI2 dataset based on (left panel) unadjusted and (right panel) data adjusted for the first PC variable. B. Scatter plots for the 1,282 probes in the combined KJX64 and KJ125 datasets based on unadjusted (left panel) and data adjusted (right panel) for the first PC variable. C. Scatter plots for the 4,088 probes in the Wang dataset based on unadjusted (left panel) and data adjusted (right panel) for the first PC variable.
Multivariable Cox proportional hazards analysis for each principal component variable1.
| All tumors | ER positive tumors | ER negative tumors | |||||
| PC1 | 1.8 | 0.0001 | 2.8 | < 0.0001 | 0.9 | 0.61 | |
| PC2 | 0.6 | < 0.0001 | 0.7 | 0.004 | 0.7 | 0.03 | |
| PC3 | 1.2 | 0.01 | 2.1 | 0.0002 | 1.1 | 0.37 | |
| Age | 0.9 | 0.0009 | 0.9 | 0.01 | 0.9 | 0.02 | |
| Tumor size2 | 1.8 | 0.01 | 2.2 | 0.009 | 2.4 | 0.04 | |
| Grade 23 | 2.0 | 0.13 | 1.1 | 0.87 | |||
| Grade 33 | 2.2 | 0.09 | 1.3 | 0.55 | |||
| ER+ | 0.9 | 0.67 | |||||
| PC1 | 1.5 | 0.004 | 2.2 | < 0.0001 | 0.9 | 0.74 | |
| PC2 | 0.6 | 0.002 | 0.6 | 0.002 | 0.5 | 0.12 | |
| Grade 24 | 2.7 | 0.002 | 3.0 | 0.005 | 1.5 | 0.54 | |
| Tumor size2 | 3.2 | 0.002 | 4.4 | 0.002 | 1.8 | 0.42 | |
| PC1 | 1.8 | < 0.0001 | 1.8 | < 0.0001 | 1.7 | 0.06 | |
| PC2 | 1.6 | < 0.0001 | 1.7 | 0.0003 | 1.6 | 0.02 | |
| PC3 | 2.0 | < 0.0001 | 2.2 | < 0.0001 | 1.5 | 0.05 | |
| ER+ | 0.8 | 0.26 | |||||
1. Up to three principal component variables (PC1-PC3), each derived from the top ranked probes for a set of correlated genes predictive of metastatic recurrence latencies, were included in the models. Hazard ratios for these PCs represent the change in hazards per standard deviation increase in the variable. A hazard ratio greater than 0 indicates that increasing levels of genes positively correlated with the PC variable are associated with an increased risk of metastasis.
2. Hazard ratios for tumor size represent differences in tumor sizes greater than 2 cm versus tumors less than 2 cm.
3. Hazard ratios for grades are relative to grade 1 tumors. Grade variables were not included in the analysis of ER negative tumors due to the low numbers of tumors representing the various grades.
4. Hazard ratios are for grade 2 versus other grades.
Figure 3Genes correlated with the cell cycle principal component are overrepresented on published gene lists. A/B/C. Each bar graph shows the percentage of the genes within selected published prognostic gene lists that have Pearson's correlation coefficients greater than 0.4 or less than -0.4 to either the first (black bars) or second (grey bars) PC variables from each data set evaluated. The bars identified as "All probes" represent the percentage of all probes in the (A) NKI2 (n = 24,495), (B) KJX64/KJ125 (n = 22,285) and (C) Wang (n = 22,286) gene expression data sets that have correlations within the indicated ranges.
Prognostic gene lists rely on genes correlated with the first (cell cycle) principal component variable1.
| 70 Gene | 91/174 | 30/235 | 1.2 (0.67) | 90/175 | |||
| Wang 76-gene (ER+) | 85/60 | 44/101 | 0.9 (0.84) | 89/56 | |||
| Wang 76-gene (ER-) | 6/29 | 5/30 | 1.2 (0.78) | 8/27 | 0.5 (0.25) | ||
| Wound Signature | 74/221 | 27/268 | 1 (0.93) | 73/22 | |||
| Sotiriou Grade | 92/203 | 25/270 | 0.8 (0.8) | 91/204 | |||
| Naderi | 131/104 | 123/112 | 0.4 (0.003)2 | 128/107 | |||
| 70 Gene | 38/111 | 15/134 | 0.7 (0.4) | 39/110 | |||
| Wang 76-gene (ER+) | 53/36 | 1.6 (0.32) | 35/54 | 1.2 (0.7) | 61/28 | 1.8 (0.24) | |
| Wang 76-gene (ER-)4 | 14/5 | n/a | 16/3 | n/a | 12/7 | n/a | |
| Sotiriou Grade | 46/143 | 35/154 | 1.7 (0.24) | 46/143 | |||
| Naderi | 71/56 | 69/58 | 1.6 (0.24) | 72/55 | |||
| 70 Gene | 49/197 | 22/224 | 0.9 (0.41) | 44/202 | |||
| Wang 76-gene (ER+) | 45/84 | 30/99 | 2 (0.11) | 49/80 | |||
| Wang 76-gene (ER-) | 26/16 | 29/13 | 15/27 | ||||
| Sotiriou Grade | 57/229 | 21/265 | 0.7 (0.25) | 49/237 | |||
| Naderi | 106/110 | 105/111 | 0.9 (0.66) | 105/111 | |||
1. Hazard ratios (HR) represent the change in risk for tumors classified as having a poor prognosis versus tumors classified as having a good prognosis using a univariate proportional hazards analysis. Gene expression data were independently adjusted using simple linear regression analysis for either an intercept only, the principal component representing the first correlated gene cluster identified (PC1) or the second correlated gene cluster identified (PC2) in each data set. Univariate HRs that are significantly (p < 0.05) greater than 1 are shown in bold.
2. Values in the Good/Poor column indicate the number of tumors assigned to the good and poor prognosis groups, respectively. The total number of tumors per gene list will vary depending upon the number of tumors used in training sets.
3. The fact that this HR is significantly less than 1 indicates that the classifier did not appropriately classify tumors into appropriate good and poor prognosis groups.
4. There were too few events among good and poor prognosis groups to perform statistical analyses in the KJX64/KJ125 data set.
Figure 4The Recurrence Score prognosticator fails to perform appropriately after adjustment for the cell cycle principal component variable. A. A ROC curve summarizing the performance of the 21-gene recurrence classifier in ER positive tumors in the NKI2 data set (n = 224 subjects) adjusted for either an intercept only (black line) or the cell cycle PC variable (grey line). The area under the curve (AUC) for each analysis is shown on the graph. The dotted line represents a ROC curve with an AUC of 0.5 (indicative of a low accuracy test). B/C. Kaplan-Meier plots for the recurrence score in the NKI2 data set for data adjusted for (B) intercept-only or (C) the cell cycle PC variable. Hazard ratios (HR) and p-values are from univariate proportional hazards analyses and represent the difference in hazards for poor versus good prognosis tumors. D. ROC curve for ER positive tumors in the KJX64/KJ125 data set (n = 147 subjects) adjusted for either an intercept only (black line) or the cell cycle PC variable (grey line).