Literature DB >> 21785460

A refined molecular taxonomy of breast cancer.

M Guedj¹, L Marisa, A de Reynies, B Orsetti, R Schiappa, F Bibeau, G MacGrogan, F Lerebours, P Finetti, M Longy, P Bertheau, F Bertrand, F Bonnet, A L Martin, J P Feugeas, I Bièche, J Lehmann-Che, R Lidereau, D Birnbaum, F Bertucci, H de Thé, C Theillet.

Abstract

The current histoclinical breast cancer classification is simple but imprecise. Several molecular classifications of breast cancers based on expression profiling have been proposed as alternatives. However, their reliability and clinical utility have been repeatedly questioned, notably because most of them were derived from relatively small initial patient populations. We analyzed the transcriptomes of 537 breast tumors using three unsupervised classification methods. A core subset of 355 tumors was assigned to six clusters by all three methods. These six subgroups overlapped with previously defined molecular classes of breast cancer, but also showed important differences, notably the absence of an ERBB2 subgroup and the division of the large luminal ER+ group into four subgroups, two of them being highly proliferative. Of the six subgroups, four were ER+/PR+/AR+, one was ER-/PR-/AR+ and one was triple negative (AR-/ER-/PR-). ERBB2-amplified tumors were split between the ER-/PR-/AR+ subgroup and the highly proliferative ER+ LumC subgroup. Importantly, each of these six molecular subgroups showed specific copy-number alterations. Gene expression changes were correlated to specific signaling pathways. Each of these six subgroups showed very significant differences in tumor grade, metastatic sites, relapse-free survival or response to chemotherapy. All these findings were validated on large external datasets including more than 3000 tumors. Our data thus indicate that these six molecular subgroups represent well-defined clinico-biological entities of breast cancer. Their identification should facilitate the detection of novel prognostic factors or therapeutical targets in breast cancer.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Biomarkers, Tumor

Year: 2011 PMID： 21785460 PMCID： PMC3307061 DOI： 10.1038/onc.2011.301

Source DB: PubMed Journal: Oncogene ISSN： 0950-9232 Impact factor: 9.867

Introduction

Breast cancer is heterogeneous. Biological features have proven insufficient for a comprehensive description of the disease. Seminal work by Sorlie et al. (2003) has delineated five major molecular subtypes of breast cancer associated to different outcomes. This initial classification was reproduced in independent datasets (Bertucci ) strongly suggesting the existence of distinct molecular entities in breast cancer. The Sorlie centroid approach has subsequently been redefined and adapted to more recent technological platforms (Hu ; Parker ). However, criticisms have pointed to the instability of the defined subtypes (Kapp ; Weigelt ) and their dependence on the original set of samples or genes. Thus, although molecular classification brings interesting insights in breast cancer taxonomy, its implementation in the clinics is put in doubt because of insufficient reliability in single sample allocation (Weigelt ). Rather, three broad classes of breast tumors drawn along their ER, PR and ERBB2/HER2 status are commonly used in the clinic. ER−/PR−/HER2− tumors were defined as triple negative, ER+/PR+/HER2− as luminal, and HER2+ tumors irrespective of their ER status form the third class (Foulkes ). However, this simple classification is also criticized because of the biological heterogeneity within classes. In particular, the correspondence between the triple-negative group and basal-like breast tumors and the heterogeneity of the large ER+/PR+ group have been repeatedly questioned (Gusterson, 2009; Foulkes ). This argues for a more elaborate stratification amenable to biological exploration and clinical choices. This prompted us to construct a robust molecular classification on a large number of samples to reach high statistical power. To this aim, we produced transcriptomes of a series of 537 primary breast cancers and, using a semi-supervised analysis, revealed six stable molecular subgroups. A related classification rule was defined. Each of the six molecular subgroups showed distinct genomic changes, correlated with a specific set of signaling pathways and was associated with significant differences in tumor grade, metastatic sites and metastasis-free survival (MFS). We propose that this classification scheme could lay the bases of an operative tool to reliably classify breast cancers in more homogeneous molecular subgroups. This classification could be highly beneficial in future investigations aiming at identifying novel prognostic factors or therapeutical targets in breast cancer.

Results

Semi-supervised gene expression analysis identifies six prototypic molecular subtypes

Our aim was to identify molecular subgroups representing homogeneous subsets of breast cancer. Our methodology is detailed in Supplementary Figure 1 and the Supplementary Methods section. Briefly, we produced a large dataset comprising 537 primary breast cancer transcriptomes on Affymetrix U133-Plus2.0 arrays to ensure proper statistical power. First, this tumor set was classified with three unsupervised methods (hierarchical clustering, Gaussian mixture models and k-means) in parallel. Of the 537 tumors, 355 yielded a consensus subgroup assignment (that is, were assigned to the same subclass) between all three methods. This subset was named coreset and was used for further analysis. Second, a minimal list of 256 discriminative genes with maximal intragroup homogeneity and intergroup heterogeneity was generated by analysis of variance (Supplementary Table 3). Hierarchical clustering based on this list delineated six homogeneous tumor subgroups, homogeneity being confirmed by the principal component analysis (Figure 1b). To allow the classification of independent sample profiles to one of the six subgroups we built a single sample predictor based on a distance-to-centroid approach (using the previously mentioned 256 genes; Supplementary Methods). The 182 tumors of the discovery set lying outside of the coreset were classified using this single sample predictor.

Figure 1

Breast tumor classification according to the CIT classification into six subgroups of tumors. (a) Heatmap representing the expression of the 256 genes (nine clusters of genes represented by vertical color bars on the left of the heatmap) through the six groups. (b) Principal component analysis (PCA) of the samples of the coreset according to the 256 gene signature. The first principal component (PC1) represents the combined expression of the three transversal clusters (ER, AR and cell cycle), the second component (PC2) differentiates LumB and NormL. (c) Distribution of mean expression levels of the three transversal gene clusters (ER, AR and Cell Cycle) over the six main molecular subgroups. (d) Comparison of the CIT classification with those obtained using the Sorlie, Hu, Parker and Jönsson systems.

The overall distribution of the six subgroups was determined by three large gene clusters shared by at least two subgroups. The first one (cluster-VI, Figures 1a and c, Supplementary Table 3) containing ESR1 and correlated genes, defined two ER-negative (ER−) and four ER-positive (ER+) subgroups (Figures 1a and c). The second gene cluster (cluster-IV) included the androgen receptor (AR) gene and encompassed five subgroups. Of the six subgroups, four were ER+/PR+/AR+, one was ER−/PR−/AR+ and one was triple negative (AR−/ER−/PR− Figures 1a and c). Interestingly, cluster-IV included transcription factors FOXA1, SPDEF and XBP1, which are usually associated to the ER-cluster (Bertucci ). The third cluster (cluster-II) was predominantly composed of genes regulating DNA replication and cell cycle progression, thus defining elevated cell proliferation. This cluster encompassed both ER− and two ER+ subgroups (Figures 1a and c). Each subgroup was defined by a specific gene cluster (Supplementary Figure 2) in which we found genes previously part of the Sorlie centroids. Hence, for simplicity we named our subgroups according to the Sorlie subtype (Sorlie ). ER+ subgroups were split according to expression levels of the cell cycle cluster. Low proliferative ER+ subgroups were differentiated by clusters-III and IX (Figure 1a, Supplementary Figure 2), comprising respectively genes from the Sorlie luminal-A and normal-like centroids (Supplementary Table 3) and were, thus, designated LumA and NormL. The two high proliferation ER+ subgroups differed sharply in ER-cluster expression levels. The subgroup expressing highest levels of ER was named LumB. The other subgroup, positioned at the boundary between ER+ and ER− tumors, was designated LumC (Figures 1a–c). Noteworthy, 40% of LumC tumors overexpressed the ERBB2/HER2 gene. Next was the AR+/ER−/PR− subgroup (Figure 1b), defined by cluster-VIII. The AR+/ER− status of this subgroup was reminiscent of the previously described ‘molecular-apocrine' subtype (Farmer ) and we designated it mApo. Although ERBB2/HER2 was overexpressed by 72% of the tumors in this subgroup, cluster-VIII did not comprise genes co-amplified with ERBB2/HER2. In fact, ERBB2/HER2+ tumors distributed in mApo and LumC subgroups (Table 1). Finally, the AR−/ER−/PR− subgroup, defined by cluster-I, presented the greatest distance to all others (Figure 1). As it shared genes with the ‘basal-like' subtype, it was designated BasL (Supplementary Table 3).

Table 1

Molecular subgroups show differential correlation to breast cancer clinico-biological parameters and different sites of metastatic relapse

CIT classification
Variable	pv	BasL	mApo	LumC	LumB	LumA	NormL
Total		53	39	48	66	61	88

ER+ _(IHC)	1.00E–50	5 (10%)	1 (3%)	37 (84%)	63 (98%)	58 (97%)	81 (93%)
ER− _(IHC)		46 (90%)	35 (97%)	7 (16%)	1 (2%)	2 (3%)	6 (7%)
ER+ _(EXP)	6.00E–68	3 (6%)	2 (5%)	48 (100%)	66 (100%)	61 (100%)	87 (99%)
ER− _(EXP)		50 (94%)	37 (95%)	0 (0%)	0 (0%)	0 (0%)	1 (1%)
PR+ _(IHC)	2.00E–25	4 (8%)	1 (3%)	25 (54%)	43 (67%)	53 (88%)	62 (71%)
PR− _(IHC)		48 (92%)	34 (97%)	21 (46%)	21 (33%)	7 (12%)	25 (29%)
PR+ _(EXP)	1.00E–37	5 (9%)	5 (13%)	32 (67%)	47 (71%)	58 (95%)	85 (97%)
PR− _(EXP)		48 (91%)	34 (87%)	16 (33%)	19 (29%)	3 (5%)	3 (3%)
ERBB2+ _(IHC)	9.00E–19	3 (7%)	19 (68%)	10 (26%)	5 (11%)	0 (0%)	0 (0%)
ERBB2− _(IHC)		43 (93%)	9 (32%)	28 (74%)	41 (89%)	37 (100%)	74 (100%)
ERBB2+ _(EXP)	4.00E–31	2 (4%)	29 (74%)	20 (42%)	2 (3%)	0 (0%)	5 (6%)
ERBB2− _(EXP)		51 (96%)	10 (26%)	28 (58%)	64 (97%)	61 (100%)	83 (94%)
AR+ _(EXP)	2.00E–57	2 (4%)	32 (82%)	47 (98%)	63 (95%)	61 (100%)	88 (100%)
AR− _(EXP)		51 (96%)	7 (18%)	1 (2%)	3 (5%)	0 (0%)	0 (0%)
P53mut	1.00E–15	29 (83%)	13 (72%)	24 (69%)	5 (16%)	1 (4%)	1 (5%)
P53wt		6 (17%)	5 (28%)	11 (31%)	27 (84%)	27 (96%)	21 (95%)
Ductal	0.05	51 (98%)	32 (84%)	39 (87%)	54 (84%)	50 (83%)	61 (77%)
Lobular	0.004	1 (2%)	1 (3%)	3 (7%)	3 (5%)	5 (8%)	15 (19%)
Other	0.1	0 (0%)	5 (13%)	3 (7%)	7 (11%)	5 (8%)	3 (4%)
SBR Grade 1	8.00E–11	0 (0%)	0 (0%)	0 (0%)	0 (0%)	7 (12%)	23 (27%)
SBR Grade 2	2.00E–13	6 (11%)	8 (21%)	21 (47%)	38 (58%)	44 (77%)	53 (62%)
SBR Grade 3	4.00E–26	47 (89%)	30 (79%)	24 (53%)	28 (42%)	6 (11%)	9 (11%)
Age (median)	4.00E–07	50	56	54	57	62	52

MR 5year	0.001	17 (36%)	14 (38%)	11 (34%)	15 (26%)	9 (20%)	6 (8%)
MR 15year	0.01	17 (36%)	14 (38%)	13 (41%)	18 (32%)	10 (22%)	11 (15%)
Bones	0.01	4 (24%)	8 (57%)	7 (54%)	14 (78%)	7 (70%)	9 (82%)
Brain	0.06	5 (29%)	3 (21%)	1 (8%)	0 (0%)	0 (0%)	2 (18%)
Liver	0.7	5 (29%)	6 (43%)	7 (54%)	8 (44%)	3 (30%)	3 (27%)
Lung	0.9	6 (35%)	4 (29%)	6 (46%)	8 (44%)	3 (30%)	4 (36%)
Other	0.1	4 (24%)	1 (7%)	7 (54%)	8 (44%)	3 (30%)	3 (27%)

Abbreviations: CIT, Cartes d'Identité des Tumeurs program; MR, metastasis relapse.

Expression of ER, PR and ERBB2/HER2 were determined by immunohistochemistry as well as by RNA expression (for greater details see Supplementary Methods). TP53 mutation status was determined by the yeast functional assay (Supplementary Methods). P-values for qualitative variables (ER, PR, ERBB2/HER2, TP53 mutation, histological type, SBR grading) result from a Fisher exact test. P-values for quantitative variables (median age) result from an analysis of variance. MR was determined 5 and 15 years after surgery. Frequency of MR in a subgroup was calculated as the ratio of MR with the total number of MR. For each subgroup, percentages of MR in a given site are determined by the number of MR in this site over the whole number of MR in the subgroup. MR may occur at more than one site; hence, the sum of percentages may not equate 100.

Molecular subgroups show distinct clinical correlations, metastatic sites and outcomes

BasL and mApo at one end of the spectrum, and LumA and NormL at the other end showed an inverse balance between high-grade and ER/PR positivity (Table 1). TP53 mutation incidence reached 83% in the BasL subgroup and gradually went down to 4% in NormL and LumA tumors (Table 1). This distribution of high-grade/ER− versus low-grade/ER+ cancers was also coherent with the median age of onset: 50 and 62 for BasL and LumA patients, respectively. Correlation with histological type was observed as well. While the BasL subgroup was composed of 98% ductal carcinomas, NormL presented 19% of invasive lobular tumors, representing 53% of all lobular cancers in the dataset, in coherence with previous findings (Bertucci ). Molecular subgroups showed differences in sites of metastatic relapse. In line with previous studies (Smid ), LumA and NormL predominantly metastasized to the bone and rarely or never to the brain, while BasL and mApo tumors metastasized to the brain and less to the bones (Table 1). ST6GALNAC5, COX2/PTGS2 and HBEGF, whose expression has recently been associated to brain metastasis (Bos ), were increased in BasL (Supplementary Figure 3). Clear differences were also found in MFS (Figure 2). BasL and mApo subgroups showed earliest recurrence (18 to 60 months). LumA and NormL had the slowest course. Metastatic recurrence plateaued between 60 and 180 months in BasL and mApo, whereas it progressively increased after 60 months in ER+ subgroups. LumA and NormL tumors presented recurrences after 120 months post-surgery. Interestingly, patterns of recurrence (early versus late) matched cell cycle cluster expression levels in the different subgroups.

Figure 2

Breast cancer molecular subgroups show distinctly different disease outcome. Kaplan–Meier curves shown in this figure represent disease-free survival with metastatic relapse as an end point. (a, b) show survival curves in the CIT and validation set, respectively. Abrupt breaks in some curves of (a) are related to small numbers of patients with long-term follow-up in these subgroups. These appear smoothed out in (b) because of greater numbers in the validation set.

Performance on external datasets

We applied our classification scheme to a large Affymetrix dataset comprising 2291 breast cancer transcriptomes we have collected from the literature (Supplementary Methods). The six molecular subgroups were perfectly reproduced, both in terms of distribution and clinical correlations and outcomes (Supplementary Table 4a, Figure 2b). To further ascertain its robustness, we tested our classification on three expression datasets from different technological platforms (Swegene, Qiagen/Operon, Eurofins MWG Operon, Roissy, France; and Agilent, Santa Clara, CA, USA). Our prediction rule being designed for Affymetrix datasets we had to adapt it to different technological contexts (Supplementary Methods). Overall molecular subgroups were reproduced on different platforms (Supplementary Table 4b and Supplementary Figure 4). Differences were noted according to the dataset, which may possibly be due to different tumor recruitment in each series. To test inter-platform reproducibility, we classified the GSE3155 dataset that was analyzed in parallel on two dual-color (Agilent and Stanford University, Palo Alto, CA, USA) and one uni-color (Applied Biosystems, Carlsbad, CA, USA) platforms (Supplementary Table 4c). Classification on both dual-color datasets showed a 90% overlap, suggesting a good inter-platform reproducibility. However, overlap dropped dramatically when dual and uni-color platforms were compared (48 and 52%). This indicates that classification rules need adaptation to technological specificities of each platform to perform optimally.

Comparison with other molecular classifiers

We next compared our classification with the Sorlie, Hu and Parker centroids (Sorlie ; Hu ; Parker ). Variable overlaps were found for BasL, LumB, LumA and NormL subgroups (Figure 1d). However, significant differences were noted for the mApo and LumC subgroups, which not only overlapped at variable levels with the ERBB2 subtype, but also with basal-like, luminal-A and –B, and normal-like groups, depending on the classifier (Supplementary Table 5). Classification differences affected the distribution of bioclinical markers among molecular subgroups. Main differences were in the fraction of ER+/PR+ and AR+ tumors in basal-like subtypes and the distribution of ERBB2+ tumors (Supplementary Table 6). MFS curves showed better separation of good and bad outcome subgroups with our classification (the CIT classification) (Supplementary Figures 5 and 6).

Molecular subgroups show differential activation of signaling pathways

We selected 40 cancer relevant pathways from public databases and tested for specific enrichment in our molecular subgroups (Supplementary Methods). Genes specific for each subgroup were identified using four algorithms. Pathways were ordered for each subgroup on the mean rank of P-values across the four methods. As shown in Figure 3, each subgroup was associated to different up or downregulated signaling pathways. The upregulation of DNA replication and repair in BasL and LumB contrasted with its downregulation in NormL. The upregulation 4/5 immune system pathways in LumC was of further note. These data indicate that molecular subgroups relate to different signaling pathways and biological processes.

Figure 3

Molecular subgroups show differential activation of major signaling pathways: correlations between a given pathway and a subgroup are indicated by color boxes. Red boxes show upregulation of the pathway, green downregulation. Up or downregulation was deduced using KEGGanim tool where relative expression measures are projected in the related KEGG pathway interaction graph. Pathways showing no clear direction of regulation were excluded.

Molecular subgroups show specific genomic anomalies

Of the 537 tumors profiled for RNA expression, 488 tumors were analyzed by array-CGH (comparative genome hybridization). A total of 21 regions of gain and 33 regions of loss were found in more than 30% of the tumors (Figure 4a, top panel). BasL and LumB showed extensive copy-number alterations (CNAs), whereas NormL and LumA were the least rearranged. Qualitative differences were also apparent (Figure 4a) and we searched for CNAs specifically associated to each subgroup. BasL and LumB tumors presented the greatest number of CNAs with 39 and 46 specific CNAs, respectively (Figure 4a, Supplementary Table 7). The number of specific events was lower in the other subgroups ranging from 2 to 8. Expectedly, amplifications at 17q12 were found in 70% of mApo tumors. LumA showed gains at 4q35 and 16p11-p13, whereas NormL tumors could be differentiated from LumA by frequency of gains at 9q33, 8p23, 16p13 and loss at 16q12.

Figure 4

Breast cancer molecular subgroups present different copy-number change (CNC) profiles. CNC profiles were established using genome-wide array-CGH on a 488 breast tumor dataset and subsequently stratified according to the CIT classification. Panel a shows frequency of gains (vertical bars going up) or losses (bars going down) at a given location on the genome. Graphs from top to bottom correspond to profiles of the whole CIT breast cancer set and each of the six molecular subgroups. Panel b represents regions of CNC correlating to a specific subgroup. Specific genomic regions for the whole CIT set are the ones for which the proportion of alterations (in gain or loss) exceeded 20%. Subgroup-specific regions are those that present significant increase in proportion (at a 0.1 FDR level) in a given subgroup tested against all others. Bars represent P-values after a standard logarithmic transformation.

CNAs were associated to large-scale gene expression modifications. A total of 786 genes comprised in intervals of gains or losses showed significantly modified expression levels. A number of regions of gains overexpressed genes encoding cell cycle and proliferation activators and, conversely, known tumor suppressors, pro-apoptotic or DNA damage checkpoint genes were found downregulated in regions of loss (Supplementary Table 7). These findings suggest that CNAs are part of a selective process associated with tumor progression, with differences from one subgroup to another. In that respect, 28 CNAs presented inverse patterns in different subgroups. These inverted patterns involved mainly BasL and LumB, but were also found in mApo and LumB or LumB and NormL (Supplementary Figure 7). Strikingly, they were associated to inverse expression of key cancer genes. These data support the notion that breast cancer subgroups arise along distinct genetic pathways. Focal DNA amplification (defined as high-level gains occurring in regions not larger than 3 Mb) occurred significantly more frequently in LumB, mApo and LumC than in the other subgroups (Supplementary Table 8a). We further investigated the occurrence of focal CNAs and analyzed a subset of 72 tumors from the CIT discovery set with high-resolution Illumina 610K-SNP-arrays (Supplementary Table 8b). We detected 246 gains and 337 losses (mean size 132 and 161 kb, respectively). We noted that 53% of the gains were also detected in our BAC-array data, while the overlap was lower for losses (19%). However, gains showed modest copy-number increase and were infrequently recurrent. Losses showed greater recurrence but this corresponded mainly to probable CNVs (identical starts and ends). We verified the overlap of our subgroups with the recently proposed CNA-based classification (Jönsson ) and observed an overall coherence with our findings. Their CNA-based Basal-complex class overlapped with our BasL, 17q12 with part of our mApo and LumC, Luminal complex and amplifier with LumB and LumC, while the Luminal-simple corresponded globally to LumA and NormL (Figure 1d).

Fraction of non-tumor cells and distribution in molecular subgroups

The fraction of non-tumor cells is frequently discussed as a confounding factor in molecular analyses of breast cancer fostering the proposition that the normal-like group was a possible artifact (Prat ). To get an objective estimate of the rate of non-diploid cells in our dataset and determine its distribution within molecular subgroups, we computed an estimate based on Illumina 610K-SNP data using a recent formula (Van Loo ). Significant differences were seen among molecular subgroups (Supplementary Figure 8a) with, surprisingly, mApo showing the lowest rate of non-diploid cells. NormL ranked third and LumA and LumB presented the highest fraction of non-diploid cells. Our results agreed with recent data (Van Loo ). However, a variable fraction of tumor cells may also be diploid, leading to an overestimation of normal cells. To assess this, a histological estimate of the non-tumor cell fraction was performed on the tumors analyzed with the Illumina 610K-SNP-arrays. This showed that SNP-based estimates of non-diploid cells were lower than pathological tumor cell content (Supplementary Figure 8b). Overall these data are coherent with the idea of NormL representing a bonafide breast cancer subgroup.

Breast cancer subgroups and mammary epithelial cell hierarchy

To test whether our subgroups relate to distinct cells of origins in the mammary gland, we took advantage of three published expression profiling datasets of sorted normal mammary epithelial cell subpopulations (Raouf ; Lim ; Pece ). We inferred a signature that discriminated the mammary stem cell (MaSC) enriched, luminal progenitor (LPC), mature luminal (MLC) and stromal cell populations, and used this signature to classify our breast tumor expression data (Supplementary Methods). As shown in Figure 5, the principal component analysis ordered normal mammary epithelial cell fractions according to a differentiation gradient and breast tumors from BasL, mApo, LumC, LumB/NormL to LumA, suggesting a proximity of BasL and mApo with either MaSC or LPC, whereas ER+ subgroups showed a gradient between LPCs and MLCs. The correlation of BasL and mApo with least differentiated cells (MaSC or LPC) in the normal mammary gland was confirmed in a second analysis (Supplementary Table 9).

Figure 5

Principal component analysis (PCA) of the CIT coreset expression profiles based on a meta-signature comparing normal mammary epithelial cell subpopulations. A 163 gene signature was produced by comparing different normal mammary cell contingents from three independent studies (GSE16997, GSE18931, GSE11395) and used in a PCA. Samples from the CIT coreset (panel a) and normal mammary gland samples (panel b) from GSE16997 were projected in the two first principal components in the upper and lower panel, respectively.

Prognostic significance of molecular subgroups

We next compared the prognostic significance in terms of metastatic relapse of our molecular subgroups to classical prognostic factors (ER, ERBB2/HER2, SBR grading and nodal involvement). As shown in Table 2, our classification signature performed better in both univariate and multivariate analyses than the classical prognostic factors, in both the discovery and validation sets. However, the absence of central pathology review in both datasets prevents us to draw firm conclusions on the independent prognostic power of our signature. In a comparative analysis with five expression signatures (Sorlie ; van ‘t Veer ; Hu ; Sotiriou ; Parker ), our signature came second after the van't Veer signature in the discovery set and performed best in the validation set (Supplementary Table 10), demonstrating the important difference in terms of prognosis among molecular subgroups.

Table 2

Prognostic significance of the CIT classification

Variable	Univariate analysis						Multivariate analysis
	Value	HR	95% CI	P-value modality	P-value model	n	HR	95% IC	P-value modality	P-value model	n
(a) Clinical parameters
CIT (ref=normL)	LumA	1.66	0.84–3.30	0.15	1.8 × 10⁻⁵	426	1.66	0.78–3.52	2.5 × 10⁻⁴	6.4 × 10⁻⁶	371
	Other	3.16	1.82–5.48	4.3 × 10⁻⁵		426	2.99	1.62–5.51
ER (ref=Pos)	Negative	1.85	1.22–2.81	0.003	0.003	426	1.19	0.72–1.97	0.5
ERBB2 (ref=Neg)	Positive	1.18	0.74–1.9	0.49	0.49	426	0.89	0.52–1.5	0.66
N (ref=0)	1+	1.43	0.85–2.38	0.18	0.17	373	1.55	0.92–2.63	0.1
T (ref=[0,1])	>1	2.08	1.3–3.31	0.0021	0.0016	422	2.21	1.3–3.76	0.003
SBR (ref=1)	2	2.92	0.91–9.36	0.07	3 × 10⁻⁴	418
	3	5.19	1.63–16.53	0.005		418
Chemotherapy adjuvant (ref=No)	Yes	1.09	0.73–1.62	0.67	0.67	378
Hormononal adjuvant (ref=No)	Yes	0.64	0.44–0.94	0.02	0.02	375

(b) Molecular signatures
CIT (ref=NormL)	LumA	1.66	0.84–3.30	1.8 × 10⁻⁵		426	2.0	0.74–5.33	3.7 × 10⁻¹		426
	Other	3.16	1.82–5.48			426	1.8	0.63–5.06
Sorlie (ref=NormL)	LumA	1.37	0.74–2.52	2.4 × 10⁻³		426	1.9	0.7–5.03	5.0 × 10⁻¹
	Other	2.29	1.31–4.00			426	1.4	0.57–3.29
Hu (ref=LumA)	NormL	1.67	0.86–3.25	9.6 × 10⁻⁵		426	2.7	0.98–7.35	4.2 × 10⁻¹
	Other	2.88	1.69–4.93			426	1.6	0.72–3.73
Parker (ref=LumA)	NormL	1.43	0.74–2.75	3.5 × 10⁻³		426	1.25	0.49–3.18	3.5 × 10⁻¹
	Other	2.26	1.34–3.81			426	0.8	0.36–1.79
GGI (ref=Low risk)	High risk	2.51	1.60–3.93	3.4 × 10⁻⁵		426	1.0	0.43–2.47	8.0 × 10⁻¹
Van′t Veer (ref=Low risk)	High risk	2.93	2.00–4.27	5.9 × 10⁻⁹		426	2.8	1.53–5.1	3.4 × 10⁻³

Abbreviations: CI, confidence-interval; CIT, our classification; HR, hazard ratio.

Relative risk was calculated taking metastatic relapse as an endpoint and compared with that of (a) clinical parameters and (b) of three molecular classifiers (Sorlie, Hu, Parker) and two prognostic signature (GGI, Van′t Veer). The dataset comprised 426 patients from the CIT discovery set for which MFS information was available. Complete clinical information was available in 371 cases explaining the smaller numbers in the multivariate analysis on prognostic factors. Prognostic significance was assessed by applying a Cox model. Columns refer to the HR, the 95% CI and the P-values for both univariate and multivariate models.

Molecular subgroups show differential response to chemotherapy

To test whether our classification could predict chemotherapy response, we analyzed three datasets of locally advanced breast cancers treated by neoadjuvant therapy followed by surgery and assessment of the pathological response. ER− breast cancers were overrepresented in the three cohorts, but our signature allowed the assignment of tumors to four subgroups after pooling LumB and LumC, as well as LumA and NormL to reach sufficient sample size by subgroup. Despite different chemotherapy protocols in individual cohorts, obvious differences in response were observed. BasL and mApo showed the best response rates with 50%, and 37% of complete response, respectively. ER+ subgroups showed 15% of complete response in LumB/LumC tumors and 0% in LumA/NormL (Table 3a). Prediction of complete pathological response of the CIT classification was then compared with that of ER status and SBR Grade in the three pooled datasets. Both in the univariate and multivariate analysis the CIT classification showed the strongest score (Table 3b).

Table 3

Differential response to chemotherapy according to molecular subgroups of the CIT classification

	Treatment	n	Response	BasL	mApo	LumC/LumB	LumA/NormL	P-value
(a) Correlation between the molecular subgroups of the CIT classification and pathological complete response to chemotherapy
Hess	T/FAC	125	pCR	17 (68%)	11 (32%)	3 (7%)	0 (0%)	2.6 × 10⁻⁹
			no pCR	8 (32%)	23 (68%)	41 (93%)	22 (100%)

CIT	EC	58	pCR	8 (53%)	6 (46%)	2 (7%)	0 (0%)	1.6 × 10⁻³
			no pCR	7 (47%)	7 (54%)	25 (93%)	3 (100%)

Bonnefoi	FEC	66	pCR	16 (43%)	7 (41%)	5 (42%)		NS
			no pCR	21 (57%)	10 (59%)	7 (58%)

	TET	58	pCR	17 (45%)	6 (35%)	3 (100%)		0.11
			no pCR	21 (55%)	11 (65%)	0 (0%)

Total		307	pCR	58 (50%)	30 (37%)	13 (15%)	0 (0%)	4.3 × 10⁻¹⁰
			no pCR	57 (50%)	51 (63%)	73 (85%)	25 (100%)

		Univariate Analysis				Multivariate Analysis
	Value	n	Odds ratio	95% CI	P-value	n	Odds ratio	95% CI	P-value
(b) Uni- and multivariate analyses of factors predictive for pathological complete response to chemotherapy in the three pooled datasets
ER	ER−	307	4.5	2.5–8.4	2.1 × 10⁻⁰⁸	291	1.6	0.67–4.2	0.28
Grade	Grade3	291	3.2	1.8–5.8	3.6 × 10⁻⁰⁵		1.9	1–3.5	0.04
CIT molecular classification	BasL/mApo	307	6.1	3.1–13	7.0 × 10⁻¹⁰		3.8	1.3–11	0.01

Abbreviations: CIT, our classification; NS, not significant; pCR, pathological complete response.

Table 3a shows the correlation between pCR and CIT molecular subgroups. pCR and absence of response (no pCR) to chemotherapy were analyzed in three clinical trials (Hess , Bonnefoi , CIT set). Owing to the small number of data, four main subgroups and two intermediate subgroups were combined into two groups: (LumB; LumC; LumB/C) and (NormL; LumA; NormL/LumA). Treatment description: EC, six cycles of a dose-dense regimen of 75 mg/m2 epirubicin and 1200 mg/m2 cyclophosphamide, given every 14 days; T/FAC, 24 weeks of sequential paclitaxel and fluorouracil-doxorubicin-cyclophosphamide; FEC, fluorouracil, epirubicin, and cyclophosphamide for six cycles; TET, docetaxel for three cycles followed by epirubicin plus docetaxel for three cycles. Correlations were calculated using Fisher exact test. Table 3b shows uni- and multivariate analyses of factors predictive of pCR in the three pooled datasets. Univariate analysis was done using the Fisher exact test and multivariate analysis by logistic regression.

Discussion

Breast cancer heterogeneity, reflected in molecular subgroups, can be attributed to differences in molecular alterations, cellular origin or both. We present a classification of breast cancer into six molecular subgroups, which differed upon gene expression, genomic profiles, differentiation level and clinical features. First, gene expression differences strongly suggested that they outlined distinct biological entities, reflecting initiating mutations and/or cell-of-origin. Specific sets of signaling pathways were associated to each subgroup. The distribution of the six subgroups was determined by the combination of the expression of three large gene clusters organized around the (i) estrogen receptor, (ii) androgen receptor and (iii) cell cycle regulator genes. The ER cluster is well known as defining luminal breast tumors (Bertucci ) and the expression of AR in breast cancer is long-known (Isola, 1993), but has been confounded with that of the ER cluster (Doane ). Its combined expression with the ER cluster yields three broad classes determined by nuclear receptor expression; AR−/ER−/PR− (triple negative) corresponding to the BasL subgroup, AR+/ER−/PR− (mApo), AR+/ER+/PR+ (triple positive) including the four ER+ subgroups. The AR cluster comprises key genes previously associated to the ER cluster, such as the pioneer factor FOXA1, which recruits ER, AR and RAR/RXR (Carroll ; Lupien ). The existence of an ER−/AR+ breast tumor subset (our mApo subgroup) has been proposed (Farmer ; Doane ), and its important overlap with ERBB2/HER2 amplification is intriguing, possibly reflecting cross-talks between the AR and ERBB2/HER2 pathways (Naderi & Hughes-Davies, 2008). However, it is notable that our classification did not define an ERBB2 subgroup. Instead, ERBB2-amplified cancers distributed in mApo (ER−) and LumC (ER+) subgroups. We found less expression differences between mApo/ERBB2+ and mApo/ERBB2− than between mApo and LumC tumors (Supplementary Figure 9). Interestingly, Staaf showed that ER− and ER+ ERBB2-amplified tumors presented different 17q CNA patterns. These observations could have implications in the clinic as they indicate that ERBB2+ breast cancer correspond to a biologically heterogeneous group. Moreover, it seems important to distinguish ERBB2+ and mApo tumors, because the so-called triple-negative group comprises both BasL and ERBB2−/mApo tumors despite clear molecular and clinical differences. Second, subgroups were also characterized by different patterns of genomic anomalies. These data were concordant with previous results (Chin ; Natrajan ) and the CGH classification recently proposed by Jönsson et al. (2010). Moreover, the existence of chromosomal regions showing inverse patterns (gain in one subgroup/loss in another) further supported the notion that these subgroups progress along distinct genetic routes, which possibly involve different mechanisms of genetic instability. Third, our data indicated that subgroups differed in their differentiation level, pointing to possible differences in cell-of-origin. This was suggested by similarities between the transcriptome of distinct cellular contingents in the normal mammary gland and those of molecular subgroups. While BasL and mApo showed proximity to MaSC or luminal progenitors, ER+ subgroups formed a gradient between LPCs (LumC) and MLC cells (LumA). Our findings are consistent with recent work suggesting that LPCs were the cells of origin of basal cancer and Brca1 mammary tumors (Lim ; Molyneux ). These findings bring insight on the prevalence of Grade 3 tumors in BasL and mApo contrasting sharply with that of low-grade cancers in NormL and LumA. Our data thus suggest that breast cancer may arise from at least two distinct cell types and that the final phenotype will result from genetic and epigenetic changes occurring during cancer progression. This may also have some link with the striking gradient of TP53 mutations observed between BasL and NormL subgroups. The correlation with elevated expression of the cell-cycle cluster and increased genomic instability was also notable. Moreover, there is a striking parallel between the incidence of TP53 inactivation and the response rates of neoadjuvant chemotherapies. These data are in line with our previous observation proposing that TP53 is not the mediator of chemotherapy-induced cell death (Bertheau ). Fourth, molecular subgroups show striking differences with respect to metastatic relapse both in terms of kinetics and site of recurrence. While BasL and mApo tumors preferentially metastasized to the brain and rarely to the bone, ER+ subgroups exhibited an inverse pattern, strengthening previous studies (Smid ). Our data suggest that these differences could be due to differential expression of key metastasis genes (Bos ). Hence, metastasis to a specific organ can also be the result of a subgroup-specific gene program and coexist with the de novo acquisition of stochastic mutations, as recently shown by massively parallel sequencing work (Ding ; Yachida ). Outcomes of the different subgroups were very different as well. BasL and mApo showed earlier relapse, but a remarkably stable MFS for the next 100 months. In contrast, although all ER+ subgroups did better during the first years, a continuous incidence of late relapse was observed. LumB and LumC outcome progressively became worse than that of BasL or mApo. However, a number of recurrences occurring after 5 years in ER+ subgroups are probably linked to interruption of anti-estrogen treatments. The status of the NormL subgroup is of particular interest because its existence has been put in doubt and attributed to an elevated content of normal cells (Prat ). In line with recently published data (Van Loo ), we showed that NormL tumors did not present a lower fraction of non-diploid cells than mApo or LumC. Furthermore, our data showed that 70% of NormL tumors showed loss at 16q, further supporting that this subgroup does not result from a co-cluterization of breast tumors presenting smaller fractions of tumor cells. Our results are in favor of the existence of different breast cancer subtypes bearing distinct biologies and clinical courses. We propose that stratifying breast cancers according to such a classification could be highly beneficial when searching for new prognostic or response to treatment indicators. These would be subgroup specific instead of expressing the differences between highly and poorly proliferating tumors. Furthermore, such a classification, once adapted in a format compatible with clinical setting, could efficiently contribute to disease management. Indeed, the different subgroups outlined here occur in different age groups, metastasize to different organs and exhibit distinct survival kinetics. Similarly, the association with immune system activation pathways in LumC may be indicative for an anti-tumor immunity in this specific subgroup. All of these are clear indications that they represent distinct clinical and biological entities.

Materials and methods

Patients and tumors

A total of 724 primary breast carcinomas were collected and analyzed for expression profiling on Affymetrix U133-Plus2.0 chips and a subset of 488 samples were analyzed by array-CGH. In addition, 58 fine-needle aspiration biopsies from patients undergoing neoadjuvant chemotherapy were analyzed by transcriptome and included in the response-to-chemotherapies set. Full description can be found in Supplementary Table 1 and Table 2. Mean follow-up time was of 65 months. Four RNA from normal human breast tissue were used as reference. Histological grade as well as ER, PR and HER2 levels determination are detailed in the Supplementary Methods.

Discovery and validation sets

Our 724 breast tumor transcriptome dataset was split in a CIT-discovery-set comprising 537 (75%) tumors of which 488 were analyzed by array-CGH and 187 (25%) cases were set apart for the validation-set. The Affymetrix validation set comprised the 187 samples from CIT and 2225 transcriptomes collected from GEO and array-express (Supplementary Table 2).

Expression profiling and data analysis

RNA profiling

Methods used for RNA purification, quality control, fluorescent probe production, hybridization and data processing were essentially as previously described (de Reynies ).

Transcriptome analysis and molecular subgroup determination

Our rational was to ensure the greatest possible homogeneity to identified subgroups. Subgroup determination was based on the CIT-discovery-set including 537 transcriptomes and a clustering approach iterating unsupervised and supervised steps (Supplementary Figure 1, Supplementary Methods). Microarray data were first classified with a set of 244 most variant probe sets using in parallel hierarchical clustering, k-means and Gaussian mixture model. Tumors that were assigned to the same group by the three methods were kept, defining a coreset of 355 tumors. Based on this coreset most discriminative genes were selected by analysis of variance and ranked by random-forest, producing a 256 gene signature, leading to the identification of six homogeneous molecular subgroups. Validation datasets were independently classified in the CIT molecular subgroups by applying a classical distance-to-centroid approach, implemented in the citbcmst R package available at the following URL http://cran.r-project.org/web/packages/citbcmst/index.html and coming with a (Sweave) user documentation. The complete classification procedure is detailed in the Supplementary Methods.

Comparison with the Sorlie, Hu and Parker classifiers

Sorlie (Sorlie ), Hu (Hu ) and Parker (Parker ) centroids were retrieved from http://genome-www.stanford.edu/breast_cancer/robustness/data/IntrinsicGeneList.txt, https://genome.unc.edu/pubsup/breastTumor/data/306genes-X-249samples-X-5subtypes+5centroids.xls and https://genome.unc.edu/pubsup/breastGEO/pam50_centroids.txt, respectively. To build the classifiers corresponding clone UniGene_IDs were mapped to Affymetrix (U133A or U133Plus2) probe sets. For Sorlie this was possible for 334 UniGene_IDs gene symbols, for Hu 232 UniGene_IDs and Parker all genes could be directly mapped.

Comparison with the Jönsson array-CGH-based classification

The 6 Jönsson centroids are relative to genomic regions determined with the GISTIC algorithm (Jönsson ). Details are provided in the Supplementary Methods.

Cancer pathways analysis

Cancer relevant pathways were retrieved from KEGG (ftp://ftp.genome.ad.jp/pub/kegg/pathways/hsa), Biocarta (http://www.biocarta.com) and GO (http://www.geneontology.org/), and related genes were mapped to non-redundant HUGO Gene symbols. Four gene set analysis methods were used (Supplementary Methods), yielding P-values, which were transformed into ranks. Gene sets were ranked by order of interest according to the mean of the ranks across the four methods.

Array-CGH

Array-CGH was performed on a 4434 BAC-array with a median resolution of 0.6 Mb. DNA labeling, hybridization and data processing are as described in the Supplementary Methods.

Statistical tests

Clinical correlations were determined by χ2 for qualitative factors and analysis of variance for quantitative variables. Disease outcome was investigated with Kaplan–Meier curves using metastatic recurrence as an endpoint and subgroup for stratification. MFS was calculated from the date of diagnosis until first metastatic relapse. P-values at 60 and 180 months resulted from a log-rank test on Cox estimates. Benjamini and Hochberg method was applied for multiple-testing adjustment.

35 in total

Review 1. Gene expression profiling and clinical outcome in breast cancer.

Authors: François Bertucci; Pascal Finetti; Nathalie Cervera; Dominique Maraninchi; Patrice Viens; Daniel Birnbaum
Journal: OMICS Date: 2006

2. Genome-wide analysis of estrogen receptor binding sites.

Authors: Jason S Carroll; Clifford A Meyer; Jun Song; Wei Li; Timothy R Geistlinger; Jérôme Eeckhoute; Alexander S Brodsky; Erika Krasnickas Keeton; Kirsten C Fertuck; Giles F Hall; Qianben Wang; Stefan Bekiranov; Victor Sementchenko; Edward A Fox; Pamela A Silver; Thomas R Gingeras; X Shirley Liu; Myles Brown
Journal: Nat Genet Date: 2006-10-01 Impact factor: 38.330

3. FoxA1 translates epigenetic signatures into enhancer-driven lineage-specific transcription.

Authors: Mathieu Lupien; Jérôme Eeckhoute; Clifford A Meyer; Qianben Wang; Yong Zhang; Wei Li; Jason S Carroll; X Shirley Liu; Myles Brown
Journal: Cell Date: 2008-03-21 Impact factor: 41.582

4. Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer.

Authors: Kenneth R Hess; Keith Anderson; W Fraser Symmans; Vicente Valero; Nuhad Ibrahim; Jaime A Mejia; Daniel Booser; Richard L Theriault; Aman U Buzdar; Peter J Dempsey; Roman Rouzier; Nour Sneige; Jeffrey S Ross; Tatiana Vidaurre; Henry L Gómez; Gabriel N Hortobagyi; Lajos Pusztai
Journal: J Clin Oncol Date: 2006-08-08 Impact factor: 44.544

5. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies.

Authors: Koei Chin; Sandy DeVries; Jane Fridlyand; Paul T Spellman; Ritu Roydasgupta; Wen-Lin Kuo; Anna Lapuk; Richard M Neve; Zuwei Qian; Tom Ryder; Fanqing Chen; Heidi Feiler; Taku Tokuyasu; Chris Kingsley; Shanaz Dairkee; Zhenhang Meng; Karen Chew; Daniel Pinkel; Ajay Jain; Britt Marie Ljung; Laura Esserman; Donna G Albertson; Frederic M Waldman; Joe W Gray
Journal: Cancer Cell Date: 2006-12 Impact factor: 31.743

6. Subtypes of breast cancer show preferential site of relapse.

Authors: Marcel Smid; Yixin Wang; Yi Zhang; Anieta M Sieuwerts; Jack Yu; Jan G M Klijn; John A Foekens; John W M Martens
Journal: Cancer Res Date: 2008-05-01 Impact factor: 12.701

7. Validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00-01 clinical trial.

Authors: Hervé Bonnefoi; Anil Potti; Mauro Delorenzi; Louis Mauriac; Mario Campone; Michèle Tubiana-Hulin; Thierry Petit; Philippe Rouanet; Jacek Jassem; Emmanuel Blot; Véronique Becette; Pierre Farmer; Sylvie André; Chaitanya R Acharya; Sayan Mukherjee; David Cameron; Jonas Bergh; Joseph R Nevins; Richard D Iggo
Journal: Lancet Oncol Date: 2007-11-19 Impact factor: 41.316

8. An estrogen receptor-negative breast cancer subset characterized by a hormonally regulated transcriptional program and response to androgen.

Authors: A S Doane; M Danso; P Lal; M Donaton; L Zhang; C Hudis; W L Gerald
Journal: Oncogene Date: 2006-02-20 Impact factor: 9.867

9. A functionally significant cross-talk between androgen receptor and ErbB2 pathways in estrogen receptor negative breast cancer.

Authors: Ali Naderi; Luke Hughes-Davies
Journal: Neoplasia Date: 2008-06 Impact factor: 5.715

10. Exquisite sensitivity of TP53 mutant and basal breast cancers to a dose-dense epirubicin-cyclophosphamide regimen.

Authors: Philippe Bertheau; Elisabeth Turpin; David S Rickman; Marc Espié; Aurélien de Reyniès; Jean-Paul Feugeas; Louis-François Plassa; Hany Soliman; Mariana Varna; Anne de Roquancourt; Jacqueline Lehmann-Che; Yves Beuzard; Michel Marty; Jean-Louis Misset; Anne Janin; Hugues de Thé
Journal: PLoS Med Date: 2007-03 Impact factor: 11.069

115 in total

Review 1. Bringing genome-wide association findings into clinical use.

Authors: Teri A Manolio
Journal: Nat Rev Genet Date: 2013-07-09 Impact factor: 53.242

2. Local depletion of DNA methylation identifies a repressive p53 regulatory region in the NEK2 promoter.

Authors: Nancy H Nabilsi; Daniel J Ryder; Ashley C Peraza-Penton; Rosha Poudyal; David S Loose; Michael P Kladde
Journal: J Biol Chem Date: 2013-10-25 Impact factor: 5.157

3. Interactions between the tumor and the blood systemic response of breast cancer patients.

Authors: Vanessa Dumeaux; Bjørn Fjukstad; Hans E Fjosne; Jan-Ole Frantzen; Marit Muri Holmen; Enno Rodegerdts; Ellen Schlichting; Anne-Lise Børresen-Dale; Lars Ailo Bongo; Eiliv Lund; Michael Hallett
Journal: PLoS Comput Biol Date: 2017-09-28 Impact factor: 4.475

4. The therapeutic response of ER+/HER2- breast cancers differs according to the molecular Basal or Luminal subtype.

Authors: François Bertucci; Pascal Finetti; Anthony Goncalves; Daniel Birnbaum
Journal: NPJ Breast Cancer Date: 2020-03-06

5. TPD52 expression increases neutral lipid storage within cultured cells.

Authors: Alvin Kamili; Nuruliza Roslan; Sarah Frost; Laurence C Cantrill; Dongwei Wang; Austin Della-Franca; Robert K Bright; Guy E Groblewski; Beate K Straub; Andrew J Hoy; Yuyan Chen; Jennifer A Byrne
Journal: J Cell Sci Date: 2015-07-16 Impact factor: 5.285

6. A Gene Regulatory Program in Human Breast Cancer.

Authors: Renhua Li; John Campos; Joji Iida
Journal: Genetics Date: 2015-10-28 Impact factor: 4.562

7. Identification of shared and unique susceptibility pathways among cancers of the lung, breast, and prostate from genome-wide association studies and tissue-specific protein interactions.

Authors: David C Qian; Jinyoung Byun; Younghun Han; Casey S Greene; John K Field; Rayjean J Hung; Yonathan Brhane; John R Mclaughlin; Gordon Fehringer; Maria Teresa Landi; Albert Rosenberger; Heike Bickeböller; Jyoti Malhotra; Angela Risch; Joachim Heinrich; David J Hunter; Brian E Henderson; Christopher A Haiman; Fredrick R Schumacher; Rosalind A Eeles; Douglas F Easton; Daniela Seminara; Christopher I Amos
Journal: Hum Mol Genet Date: 2015-10-19 Impact factor: 6.150