Literature DB >> 26770261

Iteratively refining breast cancer intrinsic subtypes in the METABRIC dataset.

Heloisa H Milioli¹, Renato Vimieiro², Inna Tishchenko³, Carlos Riveros³, Regina Berretta³, Pablo Moscato³.

Abstract

BACKGROUND: Multi-gene lists and single sample predictor models have been currently used to reduce the multidimensional complexity of breast cancers, and to identify intrinsic subtypes. The perceived inability of some models to deal with the challenges of processing high-dimensional data, however, limits the accurate characterisation of these subtypes. Towards the development of robust strategies, we designed an iterative approach to consistently discriminate intrinsic subtypes and improve class prediction in the METABRIC dataset.
FINDINGS: In this study, we employed the CM1 score to identify the most discriminative probes for each group, and an ensemble learning technique to assess the ability of these probes on assigning subtype labels using 24 different classifiers. Our analysis is comprised of an iterative computation of these methods and statistical measures performed on a set of over 2000 samples. The refined labels assigned using this iterative approach revealed to be more consistent and in better agreement with clinicopathological markers and patients' overall survival than those originally provided by the PAM50 method.
CONCLUSIONS: The assignment of intrinsic subtypes has a significant impact in translational research for both understanding and managing breast cancer. The refined labelling, therefore, provides more accurate and reliable information by improving the source of fundamental science prior to clinical applications in medicine.

Entities: Chemical Disease Gene Species

Keywords: Breast cancer; CM1 score; Classifiers; Data mining; Ensemble learning; Feature selection; Intrinsic subtypes; METABRIC; Predictor models; Subtype prediction

Year: 2016 PMID： 26770261 PMCID： PMC4712506 DOI： 10.1186/s13040-015-0078-9

Source DB: PubMed Journal: BioData Min ISSN： 1756-0381 Impact factor: 2.522

Findings

Translational research aims at bringing basic scientific discoveries into outcomes that help improve clinical decision-making. The PAM50 Breast Cancer Intrinsic Classifier [1] has lately been used to assign the molecular subtypes (luminal A, luminal B, HER2-enriched, basal-like and normal-like [2-5]) based on shrunken centroids of gene expression profiles [6]. It uses a Single Sample Predictor (SSP) model with an embedded 50-gene assay. In spite of the relevance of this method for clinical management, there are limited investigations in the literature that support this classification approach. Comparison with other methods showed only moderate agreement between subtype labels assigned, as well as independent clinical prognostic information [7-9]. Other multi-gene signatures have also been reported within the molecular patterns strongly correlated to clinical prognosis [10, 11], disease progression [12, 13], and patient survival [14]. Different methods, however, highlight a variety of gene lists of distinct size due the analysis of diverse microarray data and platform technologies. Additionally, the methods currently applied bring a pragmatic concern of using SSP models for predicting disease subtypes. Multiple classifiers or ensemble learning model, on the other hand, have compensated for poor learning algorithms by performing extra computation [15]. Therefore, there is an urgent need for translating these novel strategy to provide more accurate predictions of clinicopathological outcome. In 2012, The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [16] disclosed a rich gene expression cohort widely used for investigating breast cancer diseases. In spite of the quality of this dataset, there are some inconsistencies with regards to the subtype labels assigned in the original cohort [17]. In our previous study [17], a thorough review of the intrinsic subtypes was suggested and is, therefore, mandatory given the importance of this dataset to breast cancer research. For this report, we then propose a more robust approach to iteratively refine the labels in the METABRIC dataset based on ensemble learning. The new labels are yet correlated to well-established clinicopathological markers and patient overall survival.

Methods

Transcriptomic datasets

The breast cancer dataset disclosed by the METABRIC study (EGAS00000000083) contains cDNA microarray profiling of about 2000 samples performed on the Illumina HT-12 v3 platform (Illumina_Human_WG-v3) [16]. The samples were originally partitioned into two subsets: Discovery (997 samples) and Validation (989 samples), respectively used as training and test sets in our analysis. In this cohort, tumour samples were assigned on the five intrinsic subtypes (luminal A, luminal B, HER2-enriched, basal-like and normal-like) according to the PAM50 method [1]. The METABRIC study was approved by the Institutional Review Board [16] and our research was authorized by the Human Ethics Research Committee at The University of Newcastle, Australia (H-2013-0277).

Refinement method

The overview of the refinement method applied on the METABRIC dataset is shown in Fig. 1. The process is initialized with the discovery set and the original PAM50 labels as defined in Curtis et al. [16]. After computing the CM1 score, the top ten highly discriminative probes (five with the greatest positive CM1 score values - indicating up-regulated probes relative to the other subtypes, and five with the smallest negative values – representing down-regulation) are chosen for each class. The set of new features is used to train the 24 classifiers from the Weka software suite [18], where a ten-fold cross-validation is performed. If the majority of the classifiers agree on the same label, the sample is assigned with the corresponding subtype; otherwise it is marked as inconsistent and not further considered in the process. The stopping criterion is reached when there are no more changes in the sample labels and feature set, or when the desired Fleiss’ kappa value (κ≥0.92) is achieved between the previous and the current iteration steps (see Section Statistics below for definitions). Values between 0.81 and one are considered to be almost perfect agreement, thus 0.92 is above the average for this interval.

Fig. 1

Refinement process. The process is initialized with labels assigned using the PAM50 method. After computing the CM1 score, the top 10 highly discriminative probes are selected for each subtype. This set of features is used to train the 24 distinct classifiers for a 10-fold cross-validation classification. Samples are relabelled (eventually with the same label) if the classifiers agree in at least 50 % of the cases; otherwise they are marked as inconsistent and not further considered in the iteration process. The stopping criterion is reached when there are no more changes in the sample labels or selected feature set, or when the desired Fleiss’ kappa is achieved. After stopping, the final feature set and sample labels are used to classify the samples previously marked as inconsistent or from the validation dataset. These samples are run through the same refinement procedure; inconsistent samples are reclassified and labels are refined When the stopping condition is fulfilled, the new list of features and sample labels are used for the training-test setting. Samples from the validation dataset or previously marked as inconsistent are then classified by training the classifiers in the refined discovery set. However, in the training-test setting, at least two thirds of classifiers in the ensemble must agree on the same label for it to be assigned to a sample. As a larger dataset is expected to provide more robustness, all the reclassified samples are run through the same refinement procedure again. The final outcome of this process is the set of refined features and the new labels. Since many classifiers tend to perform best when trained on classes of equal sample size, we adjusted the number of patients in each subtype by looking at the minimum number of samples in one of the subgroups. The normal-like subtype is represented by only 58 samples; thus, the total number of samples used in the training is 290. For each other subtype, 58 samples are randomly chosen from the dataset. The whole process is run ten times due to the interchangeable sample selection that weigh the different gene expression information used for training purposes.

The CM1 score

The CM1 score is a supervised method used to rank the variation of gene expression levels across samples from two different classes (more details in [17, 19]). The measure helps to identify the most discriminative features for each of the five breast cancer intrinsic subtypes: luminal A, luminal B, HER2-enriched, basal-like and normal-like. For a given subtype, we compute the CM1 score for each of the 48,803 probes and select the ten most discriminative ones. This happens iteratively in the refinement process each time the classifiers attribute a new label to a sample.

Statistics

Several measures have been computed in order to assess the quality of our results. We created a contingency table r×c comparing the predicted labels (rows) and labels from the previous refinement step (columns). Cramer’s V [20] is used to measure the level of association between sample original and predicted labels. The statistic ranges from zero (no association between the two variables) to one (complete association). Fleiss’ kappa [21, 22] is a popular interrater reliability metric used to gauge the agreement between the original PAM50 labels and the labels assigned by the majority of classifiers. Kappa values range from ≤0 to 1, where: (1) values ≤0 show a poor agreement; (2) 0≤κ≤0.2, slight agreement; (3) 0.21≤κ≤0.40, fair agreement; (4) 0.41≤κ≤0.60, moderate agreement; (5) 0.61≤κ≤0.80substantial agreement; and (6) 0.81≤κ≤1, almost perfect agreement. Adjusted Rand Index (ARI) [23, 24] measures the agreement between pairs of samples that are labelled either in the same class or in different classes. Results range from zero (complete discordance between two partitions) to one (perfect concordance between them).

Clinical data and survival analysis

The clinical markers oestrogen and progesterone receptors (ER and PR) and the human epidermal growth factor receptor two (HER2) are compared between original METABRIC labels and refined labels. Survival analysis was also performed, using Cox proportional hazards model from the package survival in the R software [25]. The p-value, used to test the null hypothesis that the curves stratified by subtype are identical in the overall population, is calculated using the log-rank test.

Results and Discussion

Discriminative probes used to assign the intrinsic subtype labels in the refinement process

Samples were assigned into the five intrinsic subtypes based on the majority voting of classifiers (Additional file 1: Table S1), supported by their consistent performance across the ten runs (Additional file 2). During this procedure, 74 discriminative probes appeared (Additional file 1: Table S2) and, among them, 35 were recurrently selected (Fig. 2). Overall, the association between the initial labels and those predicted using the ensemble learning (Table 1) was 0.95 according to Cramer’s V. The consensus of sample labelling across different classifiers measured using Fleiss’ kappa was 0.924. The ARI (1.00) also showed a maximum agreement between pairs of samples that are labelled either in the same class or in different classes.

Fig. 2

The heat map of refined intrinsic features selected using CM1 score in the refinement process. The heat map diagram exhibit 35 probes (rows) and 1992 samples (columns) from the discovery and validation sets ordered according to gene expression similarity. For visualisation, the expression values are normalised across the probes using a two-sided threshold of 1 % (for under- and over-expression). The bars on the bottom show the sample distribution according to the refined and original labels assigned to the METABRIC cohort. The subtypes are defined as follow: luminal A (blue), luminal B (green), HER2-enriched (yellow), normal-like (purple), basal-like (red), and inconsistent (grey)

Table 1

Contingency table for predicted labels vs. initial subtypes (rows and columns, respectively)

Subtypes	Lum A	Lum B	HER2	Basal	Normal	Summary
Lum A	563	94	11	2	58	728
Lum B	102	383	77	19	19	600
HER2	7	1	149	59	18	234
Basal	0	0	0	230	3	233
Normal	33	0	1	15	95	144
Inconsistent	16	14	2	6	9	47
Summary	721	492	240	331	202	1986

New subtype labels reveal more reliable distribution of clinical markers and survival curves

We correlated the METABRIC and predicted labels with the current clinical markers ER, PR and HER2. Table 2 shows the changes in number of samples across subtypes, labelled with the PAM50 method and refined labels, respectively. The refinement process improved the overall distribution to what is expected for each class: luminal A (ER+, PR+, HER2–), luminal B (ER+, PR ±, HER2 ±), HER2-enriched (ER–, PR–, HER2+) and basal-like (ER–, PR–, HER2–); especially for HER2-enriched and basal-like subtypes. Samples labelled as inconsistent in our study may also reflect the heterogeneity of the disease and a hint to as-yet improperly characterized molecular subtypes.

Table 2

Number of samples for each clinical marker in the PAM50 subtypes and refined labels

PAM50 subtypes
Class ∖Marker	PR+	PR-	ER+	ER-	HER2+	HER2-
Luminal A	550	171	717	4	23	698
Luminal B	309	183	492	0	45	447
Her2-enriched	51	189	98	142	135	105
Basal-like	29	302	41	290	30	301
Normal-like	106	96	164	38	16	186
Refined labels
Class ∖Marker	PR+	PR-	ER+	ER-	HER2+	HER2-
Luminal A	558	170	726	2	14	714
Luminal B	358	242	599	1	83	517
Her2-enriched	11	223	19	215	139	95
Basal-like	7	226	9	224	4	229
Normal-like	85	59	115	29	4	140
Inconsistent	26	21	44	3	5	42

Number of samples for each clinical marker in the PAM50 subtypes and refined labels Furthermore, the patient’s overall survival significantly improved across subtypes when the original and refined labels are used to plot the curves for the METABRIC discovery and validation sets (Fig. 3). The groups have a well defined separation after the refinement process (p value 2.8×10−26) compared to the original labels (p value 5.4×10−18). These results also support a better characterization of the intrgroups after the iterative approach.

Fig. 3

The survival curves for original and refined labels in the METABRIC discovery and validation sets. Each curve represents the survival probability at a certain time after the diagnosis. Drops in the curve indicate death. The probability of the last ten observations are plotted in dash

Conclusion

The iterative approach using CM1 score and ensemble learning has shown a great potential for predicting more accurate sample subtypes in the METABRIC breast cancer dataset. The refined labels are of great value to breast cancer research and future clinical translational science. Given the relevance of accurate subtype assignments, we encourage researchers to consider the proposed refined labels when analysing the METABRIC dataset.

18 in total

1. A gene-expression signature to predict survival in breast cancer across independent data sets.

Authors: A Naderi; A E Teschendorff; N L Barbosa-Morais; S E Pinder; A R Green; D G Powe; J F R Robertson; S Aparicio; I O Ellis; J D Brenton; C Caldas
Journal: Oncogene Date: 2006-08-28 Impact factor: 9.867

2. Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Authors: Robert Tibshirani; Trevor Hastie; Balasubramanian Narasimhan; Gilbert Chu
Journal: Proc Natl Acad Sci U S A Date: 2002-05-14 Impact factor: 11.205

3. Molecular portraits of human breast tumours.

Authors: C M Perou; T Sørlie; M B Eisen; M van de Rijn; S S Jeffrey; C A Rees; J R Pollack; D T Ross; H Johnsen; L A Akslen; O Fluge; A Pergamenschikov; C Williams; S X Zhu; P E Lønning; A L Børresen-Dale; P O Brown; D Botstein
Journal: Nature Date: 2000-08-17 Impact factor: 49.962

4. Supervised risk predictor of breast cancer based on intrinsic subtypes.

Authors: Joel S Parker; Michael Mullins; Maggie C U Cheang; Samuel Leung; David Voduc; Tammi Vickery; Sherri Davies; Christiane Fauron; Xiaping He; Zhiyuan Hu; John F Quackenbush; Inge J Stijleman; Juan Palazzo; J S Marron; Andrew B Nobel; Elaine Mardis; Torsten O Nielsen; Matthew J Ellis; Charles M Perou; Philip S Bernard
Journal: J Clin Oncol Date: 2009-02-09 Impact factor: 44.544

5. Language Individuation and Marker Words: Shakespeare and His Maxwell's Demon.

Authors: John Marsden; David Budden; Hugh Craig; Pablo Moscato
Journal: PLoS One Date: 2013-06-27 Impact factor: 3.240

6. Characterization of uncertainty in the classification of multivariate assays: application to PAM50 centroid-based genomic predictors for breast cancer treatment plans.

Authors: Mark Tw Ebbert; Roy Rl Bastien; Kenneth M Boucher; Miguel Martín; Eva Carrasco; Rosalía Caballero; Inge J Stijleman; Philip S Bernard; Julio C Facelli
Journal: J Clin Bioinforma Date: 2011-12-23

7. The molecular portraits of breast tumors are conserved across microarray platforms.

Authors: Zhiyuan Hu; Cheng Fan; Daniel S Oh; J S Marron; Xiaping He; Bahjat F Qaqish; Chad Livasy; Lisa A Carey; Evangeline Reynolds; Lynn Dressler; Andrew Nobel; Joel Parker; Matthew G Ewend; Lynda R Sawyer; Junyuan Wu; Yudong Liu; Rita Nanda; Maria Tretiakova; Alejandra Ruiz Orrico; Donna Dreher; Juan P Palazzo; Laurent Perreard; Edward Nelson; Mary Mone; Heidi Hansen; Michael Mullins; John F Quackenbush; Matthew J Ellis; Olufunmilayo I Olopade; Philip S Bernard; Charles M Perou
Journal: BMC Genomics Date: 2006-04-27 Impact factor: 3.969

8. Repeated observation of breast tumor subtypes in independent gene expression data sets.

Authors: Therese Sorlie; Robert Tibshirani; Joel Parker; Trevor Hastie; J S Marron; Andrew Nobel; Shibing Deng; Hilde Johnsen; Robert Pesich; Stephanie Geisler; Janos Demeter; Charles M Perou; Per E Lønning; Patrick O Brown; Anne-Lise Børresen-Dale; David Botstein
Journal: Proc Natl Acad Sci U S A Date: 2003-06-26 Impact factor: 12.779

9. Identification of a 5-protein biomarker molecular signature for predicting Alzheimer's disease.

Authors: Martín Gómez Ravetti; Pablo Moscato
Journal: PLoS One Date: 2008-09-03 Impact factor: 3.240

10. A pathway-based data integration framework for prediction of disease progression.

Authors: José A Seoane; Ian N M Day; Tom R Gaunt; Colin Campbell
Journal: Bioinformatics Date: 2013-10-24 Impact factor: 6.937

8 in total

Review 1. The Necessity of Data Mining in Clinical Emergency Medicine; A Narrative Review of the Current Literatrue.

Authors: Elahe Parva; Reza Boostani; Zahra Ghahramani; Shahram Paydar
Journal: Bull Emerg Trauma Date: 2017-04

2. Data-Driven Tree Transforms and Metrics.

Authors: Gal Mishne; Ronen Talmon; Israel Cohen; Ronald R Coifman; Yuval Kluger
Journal: IEEE Trans Signal Inf Process Netw Date: 2017-08-23

3. A human breast atlas integrating single-cell proteomics and transcriptomics.

Authors: G Kenneth Gray; Carman Man-Chung Li; Jennifer M Rosenbluth; Laura M Selfors; Nomeda Girnius; Jia-Ren Lin; Ron C J Schackmann; Walter L Goh; Kaitlin Moore; Hana K Shapiro; Shaolin Mei; Kurt D'Andrea; Katherine L Nathanson; Peter K Sorger; Sandro Santagata; Aviv Regev; Judy E Garber; Deborah A Dillon; Joan S Brugge
Journal: Dev Cell Date: 2022-05-25 Impact factor: 13.417

4. Life as an early career researcher: interview with Heloisa Helena Milioli.

Authors: Heloisa Helena Milioli
Journal: Future Sci OA Date: 2016-06-06

5. PCA-PAM50 improves consistency between breast cancer intrinsic and clinical subtyping reclassifying a subset of luminal A tumors as luminal B.

Authors: Praveen-Kumar Raj-Kumar; Jianfang Liu; Jeffrey A Hooke; Albert J Kovatich; Leonid Kvecher; Craig D Shriver; Hai Hu
Journal: Sci Rep Date: 2019-05-28 Impact factor: 4.379