Literature DB >> 21092148

Predicting prognosis of breast cancer with gene signatures: are we lost in a sea of data?

Abstract

A large number of prognostic and predictive signatures have been proposed for breast cancer and a few of these are now available in the clinic as new molecular diagnostic tests. However, several other signatures have not fared well in validation studies. Some investigators continue to be puzzled by the diversity of signatures that are being developed for the same purpose but that share few or no common genes. The history of empirical development of prognostic gene signatures and the unique association between molecular subsets and clinical phenotypes of breast cancer explain many of these apparent contradictions in the literature. Three features of breast cancer gene expression contribute to this: the large number of individually prognostic genes (differentially expressed between good and bad prognosis cases); the unstable rankings of differentially expressed genes between datasets; and the highly correlated expression of informative genes.

Entities: Chemical Disease Gene Species

Year: 2010 PMID： 21092148 PMCID： PMC3016623 DOI： 10.1186/gm202

Source DB: PubMed Journal: Genome Med ISSN： 1756-994X Impact factor: 11.117

Introduction

Gene-expression profiling allows simultaneous, semi-quantitative measurements of thousands of different mRNA species in a single experiment. It was considered logical to assume that different cancers will have distinct gene-expression patterns and that the expression of many genes will be associated with clinically relevant disease outcomes in particular cancer types. Consequently, it was assumed these associations might be exploited to develop a new generation of multi-gene diagnostic tests, in particular prognostic and treatment response predictors. It has quickly become apparent that cancers of different organs have very different gene-expression patterns; indeed, this fact led to the development of a novel gene-expression-based molecular diagnostic test to assign a histological origin to metastatic cancers that present as 'cancers of unknown primary' [1]. Gene-expression profiling results also prompted re-evaluation of disease classification for certain tumors, most prominently breast cancer. Breast cancer used to be considered as a single disease with variable histological appearance and variable expression of estrogen receptor (ER) and other molecular markers. Gene-expression profiling studies revealed surprisingly large-scale molecular differences between ER-positive and ER-negative cancers that suggested that these two different types of breast cancers are distinct diseases [2-4]. A new molecular classification schema was proposed, but how many molecular classes there are and what method is best to assign these classes continues to be debated [5]. Currently, there is no standard, readily available, gene-expression-based test to determine the molecular class of breast cancer in the clinic. Molecular classification emerged through unsupervised analysis of gene-expression data. The goal of this analysis is to identify disease subsets that show similar gene-expression patterns within a larger cohort of cases. During this analysis, the molecular subsets are defined without considering clinical outcome information. Consequently, the emerging molecular subsets may or may not differ in prognosis or response to various therapies. A parallel research effort has focused on developing supervised outcome predictors. This approach relies on comparing cases with known outcome (such as recurrence versus no recurrence). The goal of the analysis is to identify differentially expressed genes between outcome groups and use these genes to develop a multi-gene outcome predictor. Evaluation of the predictive accuracy of the supervised model requires independent validation cases. Investigators who developed the first generation of supervised prognostic and treatment response predictors started with the then prevailing notion that breast cancer is a single disease, and all subtypes of breast cancer were included in the analysis. This resulted in major limitations in the diagnostic products that emerged from this research [6,7].

The plethora of prognostic gene signatures for breast cancer

Unsupervised molecular classification identified three major and robust groups of breast cancers that differ in the expression of several hundred to a few thousand genes. These include basal-like breast cancers, which are negative for ER, progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2); low histological grade ER-positive breast cancers (also called luminal A); and high grade, highly proliferative ER-positive cancers (luminal B). Several smaller and less stable molecular subsets (such as normal-like, HER-2-positive and claudin-low) have also been proposed but are less consistently seen and are distinguished by substantially smaller molecular differences [4,5]. Importantly, among the various molecular subsets, one group, the luminal A class that includes low grade ER-positive cancers, stands out with a very favorable prognosis with or without adjuvant endocrine therapy. The other groups have worse but rather similar prognosis [4,8]. If one understands these close associations between clinical phenotype, molecular class and prognosis, it is no longer surprising that comparing gene-expression profiles of breast cancers that recurred (mostly the ER-negative and the high grade, ER-positive cancers) and those that did not (low grade, ER-positive cancers) in the absence of any systemic therapy (or after anti-estrogen therapy alone in the case of ER-positive cancers) yields a very large number of differentially expressed genes. The relative position of individual genes in a rank-ordered gene list varies greatly, but the consistency of the gene list membership is fairly high across various datasets [9]. Functional annotation indicates that the majority of these prognostic genes are proliferation-related genes and the remainder are mostly ER-associated and, to a lesser extent, immune-related genes [10-12]. Because these genes function together in a coordinated manner in the regulation and execution of complex biological processes, such as cell proliferation, or originate from a particular cell type, such as immune cell infiltrate, many of these prognostic genes are also highly co-expressed with one another. It is therefore expected that a large number of nominally different prognostic signatures can be constructed that all perform equally well. For example, a particular gene may be highly significantly discriminating in two datasets but it is ranked 5th among the most discriminating genes in one dataset (based on P-value or fold difference) but only 35th in another dataset (which is still very high considering the thousands of comparisons!). In multivariate prediction model building, the top few informative features are usually combined and genes are added incrementally to increase the predictive performance. However, because many of the genes are highly correlated with each other, adding genes lower on the list yields less and less improvement in the model as a result of lack of independence. Therefore, the gene in question will be included in a predictor developed from the first dataset (because it is ranked as 5th) and will work well on validation in the second dataset; but if a new predictor were to be developed from the second dataset, this gene may not be included in the predictor (because it is ranked 35th). These three features of the breast cancer prognostic gene space - the large number of individually prognostic features, the unstable rankings, and the highly correlated expression of informative genes - explain why it is easy to construct many different prognostic predictors that perform equally well even if they rely on nominally different genes in the model. However, this does not mean that all published prognostic gene signatures are equally ready for clinical use. Before adoption in the clinic, a molecular diagnostic assay has to be standardized, the reproducibility within and between laboratories and stability of results over time have to be demonstrated, and its predictive accuracy has to be validated in the right clinical context, preferably in multiple independent cohorts of patients. Most importantly, clinical utility implies that the assay improves clinical decision making and complements or replaces older standard methods, which in turn leads to better patient outcomes. Few published prognostic predictors have met these criteria [13,14].

Why signatures work less well than expected

The predictive performance of a multivariate model largely depends on the number of independent informative genes included in the model, the magnitude of differential expression of the informative genes and the complexity of the background. Different clinical prediction problems show different degrees of difficulty. From the discussion above it should be apparent that prediction of ER status, histological grade of breast cancer, or better or worse prognosis associated with these clinical phenotypes should be relatively easy when considering all breast cancers together, and that such predictions can therefore yield predictors with good overall accuracy. Indeed, prognostic gene signatures developed for breast cancer in general or for ER-positive cancers tend to have good performance characteristics [12,15-17]. However, the first-generation prognostic signatures share some limitations. Because these were invariably developed by analyzing all subtypes of breast cancers together, they tend to assign high risk category to almost all ER-negative cancers (which are almost always high grade), even though a substantial majority of these cancers have good prognosis [18,19]. Similarly, the good- and poor-prognosis ER-positive cancers, as assigned by gene profiling, tend to correspond to the clinically low grade/low proliferation versus high grade/high proliferation subsets, respectively. This strong correlation between prognostic risk as predicted by gene signatures and routine clinical variables, such as histological grade, proliferation rate and ER status, limits the practical value of these tests. Efforts are under way to develop simple multivariate prognostic models that use routine pathological variables (such as ER, histologic grade and HER2 status), and these could eventually rival the performance of the first-generation prognostic gene signatures [20,21]. However, standardization of the pathological assessment of breast cancer and reducing the inter-observer variability remains an important challenge. Predicting clinical outcome, such as prognosis or response to chemotherapy, within clinically and molecularly more homogeneous subsets (such as triple-negative breast cancers or high grade, ER-positive cancers) would be highly desirable. Unfortunately, these prediction problems seem to be more difficult [22,23]. It seems that fewer genes are associated with outcome in homogeneous disease subsets and the magnitude of association is modest when currently available datasets are analyzed. This leads to predictors that are specific for a particular dataset from which they were developed. These prediction models are fitted to the dataset and rely on features that have no or limited generalizability. This means that they fail to validate when applied to independent data or may demonstrate only nominally significant predictive value (that is, they may predict outcome slightly better than chance). Also, the discriminating value may not be substantial enough to be clinically useful [24,25]. For example, if the good-prognosis group has a recurrence rate of 30% compared with 50% in the poor-risk group, these may be significantly different but the risk of recurrence in the good-risk group is still too high to safely forego adjuvant chemotherapy.

Can we improve prediction through new technology platforms and improved bioinformatics tools?

It seems that for certain clinical prediction problems, the currently available breast cancer gene-expression datasets may not contain enough information to be able to develop highly accurate predictors [22,23]. This may reflect limitations of the sample sizes for the subsets of interest and, as more data become available, the empirically developed models may improve. However, it is also possible that major advances will need to take place in our understanding of how the 10,000 to 12,000 genes expressed in breast cancer interact before we can construct more accurate prediction models. Current statistical methods cannot readily adjust for different levels of gene-expression change that may be required for a functional effect. The level of expression change that results in a functional change may be different from gene to gene: for some genes a 15 to 20% increase in mRNA expression level may lead to functional consequences, whereas for others a 100 to 150% change may be needed. New bioinformatics approaches, such as examining the information content of the correlation matrix of gene-expression values or applying network analysis tools to the data, may also reveal additional prognostic information that is not readily revealed by studying gene-expression levels alone. New analytical platforms, such as next generation sequencing, will generate more comprehensive expression data than the current array-based methods and will also yield extensive nucleotide sequence information. The information content of these currently nascent datasets may be highly relevant to prognosis or treatment response of cancers and certainly warrants further exploration.

Conclusions

The predictive performance of multi-gene signatures depends on the number and robustness of informative genes that are associated with the outcome to be predicted. Some clinically important prediction problems are easier to solve than others. For example, it is possible to predict the prognosis of ER-positive breast cancers relatively accurately because prognosis is closely related to the proliferative status of these cancers and proliferation affects the expression of several hundreds of genes that regulate and execute cell division. Not surprisingly, several different models that use different genes and different algorithms can be built with each performing similarly. On the other hand, predicting response to individual drugs based on gene-expression signatures has proved substantially more difficult. Fewer genes are significantly associated with these outcomes, measured on current analytical platforms (gene-expression arrays), and therefore prediction models invariably contain substantial amounts of 'noise' (predictive features that are specific to the dataset, not the actual outcome) and have poorer predictive performance on independent datasets. Larger datasets and new analytical platforms (such as next generation sequencing) that broaden the portfolio of variables that can be used for model building are expected to lead to improved predictors for these currently difficult classification problems.

Abbreviations

ER: estrogen receptor; HER2: human epidermal growth factor receptor 2; PR: progesterone receptor.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TI drafted the manuscript; LP reviewed and revised the manuscript. Both authors read and approved the final version of the article.

23 in total

1. Molecular profiling of carcinoma of unknown primary and correlation with clinical evaluation.

Authors: Gauri R Varadhachary; Dmitri Talantov; Martin N Raber; Christina Meng; Kenneth R Hess; Tim Jatkoe; Renato Lenzi; David R Spigel; Yixin Wang; F Anthony Greco; James L Abbruzzese; John D Hainsworth
Journal: J Clin Oncol Date: 2008-09-20 Impact factor: 44.544

2. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.

Authors: Leming Shi; Gregory Campbell; Wendell D Jones; Fabien Campagne; Zhining Wen; Stephen J Walker; Zhenqiang Su; Tzu-Ming Chu; Federico M Goodsaid; Lajos Pusztai; John D Shaughnessy; André Oberthuer; Russell S Thomas; Richard S Paules; Mark Fielden; Bart Barlogie; Weijie Chen; Pan Du; Matthias Fischer; Cesare Furlanello; Brandon D Gallas; Xijin Ge; Dalila B Megherbi; W Fraser Symmans; May D Wang; John Zhang; Hans Bitter; Benedikt Brors; Pierre R Bushel; Max Bylesjo; Minjun Chen; Jie Cheng; Jing Cheng; Jeff Chou; Timothy S Davison; Mauro Delorenzi; Youping Deng; Viswanath Devanarayan; David J Dix; Joaquin Dopazo; Kevin C Dorff; Fathi Elloumi; Jianqing Fan; Shicai Fan; Xiaohui Fan; Hong Fang; Nina Gonzaludo; Kenneth R Hess; Huixiao Hong; Jun Huan; Rafael A Irizarry; Richard Judson; Dilafruz Juraeva; Samir Lababidi; Christophe G Lambert; Li Li; Yanen Li; Zhen Li; Simon M Lin; Guozhen Liu; Edward K Lobenhofer; Jun Luo; Wen Luo; Matthew N McCall; Yuri Nikolsky; Gene A Pennello; Roger G Perkins; Reena Philip; Vlad Popovici; Nathan D Price; Feng Qian; Andreas Scherer; Tieliu Shi; Weiwei Shi; Jaeyun Sung; Danielle Thierry-Mieg; Jean Thierry-Mieg; Venkata Thodima; Johan Trygg; Lakshmi Vishnuvajjala; Sue Jane Wang; Jianping Wu; Yichao Wu; Qian Xie; Waleed A Yousef; Liang Zhang; Xuegong Zhang; Sheng Zhong; Yiming Zhou; Sheng Zhu; Dhivya Arasappan; Wenjun Bao; Anne Bergstrom Lucas; Frank Berthold; Richard J Brennan; Andreas Buness; Jennifer G Catalano; Chang Chang; Rong Chen; Yiyu Cheng; Jian Cui; Wendy Czika; Francesca Demichelis; Xutao Deng; Damir Dosymbekov; Roland Eils; Yang Feng; Jennifer Fostel; Stephanie Fulmer-Smentek; James C Fuscoe; Laurent Gatto; Weigong Ge; Darlene R Goldstein; Li Guo; Donald N Halbert; Jing Han; Stephen C Harris; Christos Hatzis; Damir Herman; Jianping Huang; Roderick V Jensen; Rui Jiang; Charles D Johnson; Giuseppe Jurman; Yvonne Kahlert; Sadik A Khuder; Matthias Kohl; Jianying Li; Li Li; Menglong Li; Quan-Zhen Li; Shao Li; Zhiguang Li; Jie Liu; Ying Liu; Zhichao Liu; Lu Meng; Manuel Madera; Francisco Martinez-Murillo; Ignacio Medina; Joseph Meehan; Kelci Miclaus; Richard A Moffitt; David Montaner; Piali Mukherjee; George J Mulligan; Padraic Neville; Tatiana Nikolskaya; Baitang Ning; Grier P Page; Joel Parker; R Mitchell Parry; Xuejun Peng; Ron L Peterson; John H Phan; Brian Quanz; Yi Ren; Samantha Riccadonna; Alan H Roter; Frank W Samuelson; Martin M Schumacher; Joseph D Shambaugh; Qiang Shi; Richard Shippy; Shengzhu Si; Aaron Smalter; Christos Sotiriou; Mat Soukup; Frank Staedtler; Guido Steiner; Todd H Stokes; Qinglan Sun; Pei-Yi Tan; Rong Tang; Zivana Tezak; Brett Thorn; Marina Tsyganova; Yaron Turpaz; Silvia C Vega; Roberto Visintainer; Juergen von Frese; Charles Wang; Eric Wang; Junwei Wang; Wei Wang; Frank Westermann; James C Willey; Matthew Woods; Shujian Wu; Nianqing Xiao; Joshua Xu; Lei Xu; Lun Yang; Xiao Zeng; Jialu Zhang; Li Zhang; Min Zhang; Chen Zhao; Raj K Puri; Uwe Scherf; Weida Tong; Russell D Wolfinger
Journal: Nat Biotechnol Date: 2010-07-30 Impact factor: 54.908

3. Assessment of an RNA interference screen-derived mitotic and ceramide pathway metagene as a predictor of response to neoadjuvant paclitaxel for primary triple-negative breast cancer: a retrospective analysis of five clinical trials.

Authors: Nicolai Juul; Zoltan Szallasi; Aron C Eklund; Qiyuan Li; Rebecca A Burrell; Marco Gerlinger; Vicente Valero; Eleni Andreopoulou; Francisco J Esteva; W Fraser Symmans; Christine Desmedt; Benjamin Haibe-Kains; Christos Sotiriou; Lajos Pusztai; Charles Swanton
Journal: Lancet Oncol Date: 2010-02-26 Impact factor: 41.316

4. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

5. Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns.

Authors: S Gruvberger; M Ringnér; Y Chen; S Panavally; L H Saal; M Fernö; C Peterson; P S Meltzer
Journal: Cancer Res Date: 2001-08-15 Impact factor: 12.701

Review 6. American Society of Clinical Oncology 2007 update of recommendations for the use of tumor markers in breast cancer.

Authors: Lyndsay Harris; Herbert Fritsche; Robert Mennel; Larry Norton; Peter Ravdin; Sheila Taube; Mark R Somerfield; Daniel F Hayes; Robert C Bast
Journal: J Clin Oncol Date: 2007-10-22 Impact factor: 44.544

7. Gene expression profiles obtained from fine-needle aspirations of breast cancer reliably identify routine prognostic markers and reveal large-scale molecular differences between estrogen-negative and estrogen-positive tumors.

Authors: Lajos Pusztai; Mark Ayers; James Stec; Edward Clark; Kenneth Hess; David Stivers; Andrew Damokosh; Nour Sneige; Thomas A Buchholz; Francisco J Esteva; Banu Arun; Massimo Cristofanilli; Daniel Booser; Marguerite Rosales; Vicente Valero; Constantine Adams; Gabriel N Hortobagyi; W Fraser Symmans
Journal: Clin Cancer Res Date: 2003-07 Impact factor: 12.531

8. Supervised risk predictor of breast cancer based on intrinsic subtypes.

Authors: Joel S Parker; Michael Mullins; Maggie C U Cheang; Samuel Leung; David Voduc; Tammi Vickery; Sherri Davies; Christiane Fauron; Xiaping He; Zhiyuan Hu; John F Quackenbush; Inge J Stijleman; Juan Palazzo; J S Marron; Andrew B Nobel; Elaine Mardis; Torsten O Nielsen; Matthew J Ellis; Charles M Perou; Philip S Bernard
Journal: J Clin Oncol Date: 2009-02-09 Impact factor: 44.544

9. Use of 70-gene signature to predict prognosis of patients with node-negative breast cancer: a prospective community-based feasibility study (RASTER).

Authors: Jolien M Bueno-de-Mesquita; Wim H van Harten; Valesca P Retel; Laura J van 't Veer; Frits Sam van Dam; Kim Karsenberg; Kirsten Fl Douma; Harm van Tinteren; Johannes L Peterse; Jelle Wesseling; Tin S Wu; Douwe Atsma; Emiel Jt Rutgers; Guido Brink; Arno N Floore; Annuska M Glas; Rudi Mh Roumen; Frank E Bellot; Cees van Krimpen; Sjoerd Rodenhuis; Marc J van de Vijver; Sabine C Linn
Journal: Lancet Oncol Date: 2007-11-26 Impact factor: 41.316

10. Thresholds for therapies: highlights of the St Gallen International Expert Consensus on the primary therapy of early breast cancer 2009.

Authors: A Goldhirsch; J N Ingle; R D Gelber; A S Coates; B Thürlimann; H-J Senn
Journal: Ann Oncol Date: 2009-06-17 Impact factor: 32.976

17 in total

1. Gene expression and pathologic response to neoadjuvant chemotherapy in breast cancer.

Authors: Agnieszka Kolacinska; Wojciech Fendler; Janusz Szemraj; Bozena Szymanska; Ewa Borowska-Garganisz; Magdalena Nowik; Justyna Chalubinska; Robert Kubiak; Zofia Pawlowska; Maria Blasinska-Morawiec; Piotr Potemski; Arkadiusz Jeziorski; Zbigniew Morawiec
Journal: Mol Biol Rep Date: 2012-02-09 Impact factor: 2.316

2. Subtype-dependent prognostic relevance of an interferon-induced pathway metagene in node-negative breast cancer.

Authors: Maurizio Callari; Valeria Musella; Eleonora Di Buduo; Marialuisa Sensi; Patrizia Miodini; Matteo Dugo; Rosaria Orlandi; Roberto Agresti; Biagio Paolini; Maria Luisa Carcangiu; Vera Cappelletti; Maria Grazia Daidone
Journal: Mol Oncol Date: 2014-05-04 Impact factor: 6.603

Review 3. Breast cancer intrinsic subtype classification, clinical use and future trends.

Authors: Xiaofeng Dai; Ting Li; Zhonghu Bai; Yankun Yang; Xiuxia Liu; Jinling Zhan; Bozhi Shi
Journal: Am J Cancer Res Date: 2015-09-15 Impact factor: 6.166

4. Challenges translating breast cancer gene signatures into the clinic.

Authors: Britta Weigelt; Lajos Pusztai; Alan Ashworth; Jorge S Reis-Filho
Journal: Nat Rev Clin Oncol Date: 2011-08-30 Impact factor: 66.675

Review 5. Breast cancer classification and prognostication through diverse systems along with recent emerging findings in this respect; the dawn of new perspectives in the clinical applications.

Authors: Vida Pourteimoor; Samira Mohammadi-Yeganeh; Mahdi Paryan
Journal: Tumour Biol Date: 2016-09-20

6. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures.

Authors: Anne-Claire Haury; Pierre Gestraud; Jean-Philippe Vert
Journal: PLoS One Date: 2011-12-21 Impact factor: 3.240

7. Gene expression profiling in human lung development: an abundant resource for lung adenocarcinoma prognosis.

Authors: Lin Feng; Jiamei Wang; Bangrong Cao; Yi Zhang; Bo Wu; Xuebing Di; Wei Jiang; Ning An; Dan Lu; Suhong Gao; Yuda Zhao; Zhaoli Chen; Yousheng Mao; Yanning Gao; Deshan Zhou; Jin Jen; Xiaohong Liu; Yunping Zhang; Xia Li; Kaitai Zhang; Jie He; Shujun Cheng
Journal: PLoS One Date: 2014-08-20 Impact factor: 3.240

8. Coexpression analysis of large cancer datasets provides insight into the cellular phenotypes of the tumour microenvironment.

Authors: Tamasin N Doig; David A Hume; Thanasis Theocharidis; John R Goodlad; Christopher D Gregory; Tom C Freeman
Journal: BMC Genomics Date: 2013-07-11 Impact factor: 3.969

Review 9. Optimal approach in early breast cancer: Adjuvant and neoadjuvant treatment.

Authors: J Ribeiro; B Sousa; F Cardoso
Journal: EJC Suppl Date: 2013-09

10. Tamoxifen therapy benefit predictive signature coupled with prognostic signature of post-operative recurrent risk for early stage ER+ breast cancer.

Authors: Hao Cai; Xiangyu Li; Jing Li; Lu Ao; Haidan Yan; Mengsha Tong; Qingzhou Guan; Mengyao Li; Zheng Guo
Journal: Oncotarget Date: 2015-12-29