Literature DB >> 32637048

Comparison of unsupervised machine-learning methods to identify metabolomic signatures in patients with localized breast cancer.

Jocelyn Gal1, Caroline Bailleux2, David Chardin3,4, Thierry Pourcher4, Julia Gilhodes5, Lun Jing4, Jean-Marie Guigonis4, Jean-Marc Ferrero2, Gerard Milano6, Baharia Mograbi7, Patrick Brest7, Yann Chateau1, Olivier Humbert3,4, Emmanuel Chamorey1.   

Abstract

Genomics and transcriptomics have led to the widely-used molecular classification of breast cancer (BC). However, heterogeneous biological behaviors persist within breast cancer subtypes. Metabolomics is a rapidly-expanding field of study dedicated to cellular metabolisms affected by the environment. The aim of this study was to compare metabolomic signatures of BC obtained by 5 different unsupervised machine learning (ML) methods. Fifty-two consecutive patients with BC with an indication for adjuvant chemotherapy between 2013 and 2016 were retrospectively included. We performed metabolomic profiling of tumor resection samples using liquid chromatography-mass spectrometry. Here, four hundred and forty-nine identified metabolites were selected for further analysis. Clusters obtained using 5 unsupervised ML methods (PCA k-means, sparse k-means, spectral clustering, SIMLR and k-sparse) were compared in terms of clinical and biological characteristics. With an optimal partitioning parameter k = 3, the five methods identified three prognosis groups of patients (favorable, intermediate, unfavorable) with different clinical and biological profiles. SIMLR and K-sparse methods were the most effective techniques in terms of clustering. In-silico survival analysis revealed a significant difference for 5-year predicted OS between the 3 clusters. Further pathway analysis using the 449 selected metabolites showed significant differences in amino acid and glucose metabolism between BC histologic subtypes. Our results provide proof-of-concept for the use of unsupervised ML metabolomics enabling stratification and personalized management of BC patients. The design of novel computational methods incorporating ML and bioinformatics techniques should make available tools particularly suited to improving the outcome of cancer treatment and reducing cancer-related mortalities.
© 2020 The Authors.

Entities:  

Keywords:  Breast neoplasms; Computer simulation; Metabolomics; Unsupervised machine learning

Year:  2020        PMID: 32637048      PMCID: PMC7327012          DOI: 10.1016/j.csbj.2020.05.021

Source DB:  PubMed          Journal:  Comput Struct Biotechnol J        ISSN: 2001-0370            Impact factor:   7.271


Introduction

Breast cancer (BC) is the most common type of cancer in women worldwide and the second leading cause of cancer-associated deaths [1]. The treatment strategy may be guided by two classifications indicating the aggressiveness of the tumor. The anatomy-clinical classification is based on age, TNM, histological factors (histological grade, Ki-67) as well as on hormonal-receptor status and Her-2 expression. The molecular classification resulting from genomic [2], transcriptomic [3] and proteomic [4] analyses introduced the concept of luminal A, luminal B, Her-2 and basal-like BC [5], [6], [7]. This latter classification from Perou and Sorlie was assessed using unsupervised analyses [6], [8]. Efforts have been made to develop multivariate prognostic models such as, AdjuvantOnline®, PREDICT Tool [9], [10] and multigene predictors [11], [12]. The use of biomarker-based tests, including omics-based tests, has steadily increased over the last decade as a result of the need for personalized treatment strategies designed to optimize outcomes [13], [14], [15], [16], [17], [18]. Several genomic prognostic markers have been described for BC such as OncotypeDX®, Prosigna®, MammaPrint®, Endopredict® Genomic grade index® and BC Index® [19]. Two markers are commercially available and are increasingly used in clinical practice (21-gene recurrence score OncotypeDX® and 70-gene prognostic signature MammaPrint®). However, heterogeneity persists in biological features within BC subtypes, thus highlighting the need to improve the taxonomy [20]. This heterogeneity may be related to specific combinations of genetic, pathological and environmental factors leading to specific metabolic alterations and interactions [21], [22]. Metabolomics is a new and growing field dedicated to the study of metabolism at overall level that promises to provide new insights into disease mechanisms and drug effects. Indeed, metabolomics may offer a complementary approach to genomics and could be used to better understand the influence of the environment on tumor phenotype [23]. Two distinct approaches characterize metabolomics: a targeted approach aimed at quantifying as accurately as possible a limited number of predefined metabolites of interest [24] and an untargeted approach aimed at measuring, without any a priori, as many metabolites as possible in a sample [25], [26]. As with other omics approaches, metabolomics generates high-dimensional data. The processing of these data can be done by applying supervised or unsupervised machine learning (ML) algorithms that are increasingly used for medical diagnosis and therapeutic strategy guidance [27], [28], [29]. Unsupervised ML, in which no a priori class label information is given to guide the algorithm [30], seems a suitable alternative to analyze these data and address the problem of BC heterogeneity [6]. The aim of this study was to compare metabolomic signatures of BC obtained using five different unsupervised ML methods. To evaluate the consistency of our results, the clusters obtained by unsupervised ML methods were compared with patients’ clinical characteristics and identified metabolic pathways.

Material and methods

Patients

This is a retrospective cohort study based on data and samples from 52 patients already available in the Centre Antoine Lacassagne tumor bank and collected during routine practice between 2013 and 2016. Patient tumor characteristics were: clinical stages I to IIIB biopsy-proven BC, with an indication for post-surgery adjuvant therapy. Tumor phenotypes were classified into three subtypes: triple-negative (estrogen receptor, progesterone receptor and Her-2 non-over-expressed); luminal (estrogen receptor and/or progesterone receptor positive and Her-2 non-over-expressed); Her-2 over-expressed (Her-2 over-expressed, estrogen receptor and progesterone receptor either positive or negative) [31]. After surgery, all patients were treated according to current guidelines, with sequential chemotherapy including anthracyclines (epirubicin and cyclophosphamide) and taxanes followed by radiotherapy. Patients with Her-2 over-expressed tumors were treated with trastuzumab concurrently with taxanes and continued for one year. Patients with luminal BC were then treated by endocrine therapy with tamoxifen or an aromatase inhibitor, based on menopausal status. Clinical, histological, radiological and therapeutic data were retrospectively extracted from our facility’s digital records or collected by a clinical data monitor. Follow-up data were either extracted from our facility’s digital records or retrieved by telephone if patients had changed facilities during surveillance. Written informed consent was obtained from all study participants. All procedures performed in this study involving tissue collection and analyses were following the ethical standards of the institutional and/or national research committee (French National Commission for Informatics and Liberties N°17003 and National Institute Health data N°1515251018).

Data-preprocessing, metabolite identification, statistical and pathway analysis

Sample collection, preparation and data-processing using MZmine [32], [33] are shown in Supplementary Material S1 and Supplementary Fig. 1 Metabolites obtained from positive and negative ionization modes were combined. Only metabolites with no null values after pre-processing were selected for analysis. When a metabolite was detected in both positive and negative modes, only the mode offering the highest average intensity was considered. After these steps, 1271 metabolites were identified. To eliminate noisy data, a filtering function was applied before statistical analysis. Finally, statistical analysis was performed on 449 metabolites. The identification of metabolic pathways was performed using MetaboAnalyst database sources [34]. The impact score was determined by the relative pathway topological effect of the metabolites, and -log(p) was used as the enrichment score, reflecting the probability of the pathway being identified at random; the number of “hits” was the actual number of matched metabolites in the pathway. For the selection of the most relevant pathways, we applied the following criteria: Impact >0, FDR < 0.25 and p < 0.05 [35]. A Venn diagram (http://bioinformatics.psb.ugent.be/webtools/Venn/) was used to display all possible logical relations between the metabolites or pathways identified by the clustering methods. Differences between clusters regarding the most active metabolites were plotted using boxplots.

Clustering algorithms

Five unsupervised clustering methods were selected and compared: Principal Component Analysis (PCA) k-means, Sparse k-means, Single-cell Interpretation via Multi-kernel LeaRning (SIMLR), k-sparse and Spectral clustering. Many clustering approaches exist, among which two of the most popular are K-means and spectral clustering [36]. PCA k-means and Sparse k-means are two well established, K-means based methods frequently used in computational. SIMLR and K-sparse are two recently developed k-means based methods of particular interest for omics data. These methods use different dimension reduction steps with k-means. In order to apply these five unsupervised clustering methods, the optimal number of clusters was determined in advance using five criteria: gap [37], silhouette [38], [39], Davies-Bouldin [40], Calinski-Harabasz [41] and SIMLR method [42]. PCA k-means clustering, combines PCA to reduce the number of dimensions of a dataset and the k-means method to minimize the intra-cluster variance for a chosen number of k clusters [43], [44], [45]. Spectral clustering [46], [47] is based on graph theory. It consists of identifying dense regions in a multidimensional dataset, i.e. observations that can form a non-convex set but are close to each other. Sparse k-means clustering was developed in 2010 by Witten and Tibshirani [8]. This method is based on a Least Absolute Shrinkage and Selection Operator (LASSO) approach [48] and combines the LASSO approach and the k-means method which simultaneously find the clusters and select features. SIMLR clustering [42] was developed to analyze scRNA-seq data. This method searches for appropriate cell-to-cell similarity metrics to perform dimension reduction and clustering. In multiple-kernel learning frameworks, this method may be especially beneficial for data containing no identifiable clusters. K-sparse clustering [49] is an algorithm combining dimension reduction and relevant feature selection using a constraint in L1-norm rather than a lasso-type penalty to select the features. The performance of an unsupervised clustering method is measured by its ability to partition data. Partitioning is considered optimal when it minimizes the average distance between patients within a cluster (homogeneity) and maximizes cluster distances 2 by 2 (separability). The performances of the five methods were compared using the silhouettes index (SI) [39]. The SI ranges between −1 and 1 and assesses whether a patient belongs to the “right” cluster. The closer the index is to 1, the more satisfactory the assignment of a patient to a cluster. The t-SNE method was used for data visualization [50]. Processing times were obtained on a computer using an i5 processor (3.1 GHz).

Clinical evaluation

The relevance of the discovered clusters was assessed by comparing the clinical and survival characteristics between clusters using χ2 or Fisher’s exact tests for categorical data, analysis of variance or Mann-Whitney’s test for continuous variables and log-rank test for censored data. Overall survival (OS) was defined as the time between diagnosis and death due to any cause. Specific survival (SS) was determined by the time between diagnosis and death due to BC. Recurrence-Free Survival (RFS) was defined as the time between diagnosis and the first recurrence (local, regional and metastasis). Patients showing no event (death or recurrence) or lost to follow-up were censored at the date of their last contact. OS, SS, and RFS were estimated using the Kaplan-Meier method. Median follow-up with a 95% confidence interval was calculated by reverse Kaplan–Meier method. All analyses were performed with Matlab® R2018b for PCA k-means, Spectral clustering, SIMLR (https://github.com/BatzoglouLabSU/SIMLR/tree/SIMLR/MATLAB) and k-sparse clustering and R [51] using package Sparcl [52] for sparse k-means clustering. The difference between clusters regarding the most biologically significant metabolites was plotted using boxplots. For clinical and biological analyses, all p-values <0.05 (two-sided) were considered statistically significant.

Prediction for 5- and 10-year overall and specific survival

Web-based prognostication PREDICT tool (https://breast.predict.nhs.uk/tool) [9], [10], [53] was used to estimate predicted OS (pOS) and predicted SS (pSS) at 5 and 10 years, based on several patient and tumor characteristics. For each patient, ten characteristics were entered manually: age at diagnosis, menopausal status, estrogen receptor status, Her-2 status, Ki-67 status, tumor stage, histological grade, mode of detection, number of positive nodes and presence of micrometastases. PREDICT tool can be used to estimate expected overall survival at 5 years and 10 years in the absence of available survival data due to short follow-up. If information was missing for detection, bisphosphonate therapy or menopausal status, patients were not excluded but the “unknown” category was used. Only one patient was excluded because of missing tumor grade data. A 1000 resamples bootstrap was used to estimate the 95% confidence interval.

Results

Patient characteristics

Tumor and treatment features of the 52 patients were described in Table 1. Median age was 63 years (range: 37–88). The main histological type was invasive ductal carcinoma (92%), and the main tumor stages were T1 (40.5%) and T2 (46%). Twenty-four patients (46%) presented axillary lymph node invasion. Two patients (4%) were oligometastatic at diagnosis. Forty-three percent of patients had histological grade II tumors and 47% had grade III tumors. Half of the patients had negative hormone receptor status (48%) and 24% of patients had Her-2 over-expression. Median follow–up was 48.5 months (95%CI [43–54.5]). Twenty-one patients presented a recurrence: 4 local recurrences (7.5%), 6 regional recurrences (11.5%) and 11 metastatic recurrences (21%). Three-year OS was 90% [82], [83], [84], [85], [86], [87], [88], [89], [90], [91], [92], [93], [94], [95], [96], [97], [98], [99], 3-year SS was 92% [85], [86], [87], [88], [89], [90], [91], [92], [93], [94], [95], [96], [97], [98], [99], [100] and 3-year RFS was 82% [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90], [91], [92], [93] (Supplementary Fig. 2). Median OS, SS, and RFS were not reached.
Table 1

Patients’ demographics and treatment characteristics.

Clinical characteristicNo. of patients%
Age (median min – max)63.2 (37–88)



Histology type
 Invasive ductal carcinoma4892
 Invasive lobular carcinoma36
 Microinvasive carcinoma12



Tumor stage
 T12140.5
 T22446
 T3713.5



Axillary lymph node status
 N02854
 N+2446



Metastasis
 M05096
 M124



Histological grade
 I510
 II2243
 III2447
Hormonal receptors status*
 Negative2548
 Positive2752



Her-2 status
 Non-over-expressed4074
 Over-expressed1224



Triple-negative status
 No3771
 Yes1529



Tumor phenotype
 Her21223
 Luminal2548
 Triple-Negative1529



Adjuvant Chemotherapy
 No1325
 Yes3975



Adjuvant Radiotherapy
 No917
 Yes4383



Adjuvant Hormonotherapy
 No2446
 Yes2854

Oestrogen and/or progesterone.

Patients’ demographics and treatment characteristics. Oestrogen and/or progesterone.

Clustering results

Estimated number of clusters

Using four methods (Gap statistic, Calinski-Harabasz, Silhouette and SIMLR criterion), the optimal number of clusters was equal to three (k = 3) (Supplementary Fig. 3). Only for Davies-Bouldin criterion, the optimal number of clusters was equal to four (k = 4). It seems reasonable, therefore, to conclude that the optimal number of clusters is equal to 3.

Patient distribution

Three clusters were identified with each of the five clustering methods, (Fig. 1). In terms of processing times, PCA k-means was the fastest and K-sparse was the longest (Supplementary Table 1). SIMLR and k-sparse methods were the most discriminants with an average silhouette value of 0.85 and 0.91, respectively (Fig. 2). Seventy-three percent of patients (38/52) were ranked in the same clusters by the five methods, 17.5% of patients (9/52) were classified in the same clusters by 4 methods and 9.5% of patients (5/52) were classified in the same clusters by 3 methods.
Fig. 1

Visualization of each cluster by clustering method using T-sne.

Fig. 2

Silhouette value (SI) representation for each patient by clustering method.

Visualization of each cluster by clustering method using T-sne. Silhouette value (SI) representation for each patient by clustering method.

Comparison of clinical characteristics between clusters

As shown in Table 2, the 5 methods revealed significant inter-cluster differences. Patients in cluster 3 had mainly unfavorable prognostic factors: tumor stage T2/T3, histological grade III, high mitotic score and triple-negative phenotype. In contrast, patients in cluster 1 had mainly favorable prognosis factors: tumor stage T1, histological grade I/II, lower mitotic score and luminal phenotype, whereas patients in cluster 2 constitute an intermediate group presenting both good and poor prognostic factors. Clusters defined by PCA k-means were significantly different for 5 characteristics: tumor stage, mitosis, tumor phenotype, Her-2 status and luminal. Clusters defined by Spectral Clustering were significantly different for 6 characteristics: tumor stage, histological grade, mitosis, Ki67, tumor phenotype and luminal. Clusters defined by Sparse k-means were significantly different for 4 characteristics: histological grade, tumor phenotype, Her-2 status and luminal. Clusters defined by SIMLR were significantly different for 6 characteristics: tumor stage, histological grade, mitosis, Ki67, tumor phenotype and luminal. Clusters defined by K-Sparse were significantly different for 6 characteristics: tumor stage, histological grade, mitosis, Ki67, tumor phenotype and luminal. From a strictly clinical point of view, Spectral clustering, SIMLR and K-sparse are the 3 most discriminating methods. Indeed, for these 3 methods, six prognostic factors (tumor stage, histological grade, mitosis score, Ki-67, tumor phenotype and luminal) were distributed significantly different between the 3 clusters.
Table 2

Clinical comparison of 52 patients between clusters.

Clinical characteristicPCA-K-means
Spectral Clustering
Sparse K-means
SIMLR
K-Sparse
C1 (N = 21)C2 (N = 10)C3 (N = 21)P-valueC2 (N = 19)C1 (N = 12)C3 (N = 21)P-valueC1 (N = 24)C2 (N = 8)C3 (N = 20)P-valueC1 (N = 17)C2 (N = 12)C3 (N = 23)P-valueC1 (N = 19)C2 (N = 12)C3 (N = 21)P-value
Age a62.7 (15.2)64.8(16)62.9(15)0.9364.8 (14.3)62.5 (16.5)62 (15.3)0.864.1(15)60.5 (17.2)63 (14.9)0.8564.3 (14.1)64.9 (16.1)61.4 (15.6)0.75564.8(14.3)62.5(16.5)62(15.3)0.827
Histology type10.3920.1060.7520.392
 Ductal carcinoma19(90.5)10(1 0 0)19(90.5)17(89.5)11(91.7)20(95.2)21(87.5)7(87.5)20(1 0 0)15(88.2)12(1 0 0)21(91.3)17(89.5)11(91.7)20(95.2)
 Lobular carcinoma2(9.5)0(0)1(4.8)2(10.5)1(8.3)0(0)3(12.5)0(0)0(0)2(11.8)0(0)1(4.3)2(10.5)1(8.3)0(0)
 Microinvasive carcinoma0(0)0(0)1(4.8)0(0)0(0)1(4.8)0(0)1(12.5)0(0)0(0)0(0)1(4.3)0(0)0(0)1(4.8)
Tumor stage0.0050.0180.0630.0450.018
 T114(66.7)3(30)4(19)12(63.2)5(41.7)4(19)14(58.3)2(25)5(25)10(58.8)6(50)5(21.7)12(63.2)5(41.7)4(19)
 T2/T37(33.3)7(70)17(81)7(36.8)7(58.3)17(81)10(41.7)6(75)15(75)7(41.2)6(50)18(78.3)7(36.8)7(58.3)17(81)
Axillary lymph node0.1620.0750.5260.3870.075
 N014(66.7)6(60)8(38.1)14(73.7)6(50)8(38.1)15(62.5)4(50)9(45)11(64.7)7(58.3)10(43.5)14(73.7)6(50)8(38.1)
 N+7(33.3)4(40)13(61.9)5(26.3)6(50)13(61.9)9(37.5)4(50)11(55)6(35.3)5(41.7)13(56.5)5(26.3)6(50)13(61.9)
Metastasis0.667110.4971
 M021(1 0 0)10(1 0 0)19(90.5)18(94.7)12(1 0 0)20(95.2)23(96)8(1 0 0)19(95)17(1 0 0)12(1 0 0)21(86.9)18(94.7)12(1 0 0)20(95.2)
 M10(40)0(0)2(9.5)1(5.3)0(0%)1(4.8)1(4)0(0%)1(5)0(0%)0(0%)2(13.1)1(5.3)0(0)1(50)
Histological grade0.1090.0250.0080.0070.025
 I/II13(61.9)7(70)7(35)12(63.2)9(75)6(30)15(62.5)5(71.4)7(35)11(64.7)9(75)7(31.8)12(63.2)9(75)6(30)
 III8(38.1)3(30)13(75)7(36.8)3(25)14(70)9(37.5)2(28.6)13(65)6(35.3)3(25)15(68.2)7(36.8)3(25)14(70)
Mitosis0.0240.0160.1330.0050.016
 111(52.4)4(40)2(10)10 (52.6)5 (41.7)2 (10)11 (45.8)2 (28.6)4 (20)10 (58.8)5 (41.7)2 (9.1)10 (52.6)5 (41.7)2 (10)
 23(14.3)4(40)7(35)3 (15.8)5 (41.7)6 (30)4 (16.7)4 (57.1)6 (30)2 (11.8)5 (41.7)7 (31.8)3 (15.8)5 (41.7)6 (30)
 37(33.3)2(20)11(55)6 (31.6)2 (16.7)10 (60)9 (37.5)1 (14.3)10 (50)5 (29.4)2 (16.7)13 (59.1)6 (31.6)2 (16.7)12 (60)
Ki67 a25(5,100)27.5(10,90)60(10,90)0.06641.1 (30.6)33(22.6)58.8 (27.2)0.02730 (19.2, 80)35 (23.8, 45)60 (28.8, 90)0.19638 (31)32.8 (22.7)59.7 (25.9)0.00941.1 (30.6)33 (22.6)58.8(27.2)0.027
Tumour phenotype0.0240.0120.0060.0180.012
 Her-2 over-expressed1(4.8)4(40)7(33.3)1(5.3)4(33.3)7(33.3)2(8.3)4(50)6(30)1(5.9)4(33.3)7(30.4)1(5.3)4(33.3)7(33.3)
 Luminal14(66.7)5(50)6(28.6)13(68.4)7(58.3)5(23.8)16(66.7)4(50)5(25)12(70.6)7(58.3)6(26.1)13(68.4)7(58.3)5(23.8)
 Triple-Negative6(28.6)1(10)8(38.1)5(26.3)1(8.3)9(42.9)6(25)0(0)9(45)4(23.5)1(8.3)10(43.5)5(26.3)1(8.3)9(42.9)
Hormonal receptors status0.1780.0750.1120.0710.075
 Negative7(33.3)5(50)13(61.9)6(31.6)5(41.7)14(66.7)8(33.3)4(50)13(65)5(29.4)5(41.7)15(65.2)6(31.6)5(41.7)14(66.7)
 Positive14(66.7)5(50)7(38.1)13(68.4)7(58.3)7(33.3)16(66.7)4(50)7(35)12(70.6)7(58.3)8(34.8)13(68.4)7(58.3)7(33.3)
Her-2 status0.0280.0610.0310.1150.061
 Non-over-expressed20(95.2)6(60)13(66.7)18(94.7)8(66.7)14(66.7)22(91.7)4(50)14(70)16(94.1)8(66.7)16(69.6)18(94.7)6(66.7)14(66.7)
 Over-expressed1(4.8)5(40)6(33.3)1(5.3)4(33.3)7(33.3)2(8.3)4(50)6(30)1(5.9)4(33.3)7(30.4)1(5.3)4(33.3)7(33.3)
Triple-Negative status0.2720.1040.0510.0870.104
 No15(71.4)9(90)13(61.9)14(73.7)11(91.7)12(57.1)18(75)8(1 0 0)11(55)13(76.5)11(91.7)13(56.5)14(73.7)11(91.7)12(57.1)
 Yes6(28.6)1(10)8(38.1)5(26.3)1(8.3)9(42.9)6(25)0(0)9(45)4(23.5)1(8.3)10(43.5)5(26.3)1(8.3)9(42.9)
Luminal0.0470.0140.0180.0150.014
 No7(33.3)5(50)15(71.4)6(31.6)5(41.7)16(76.2)8(33.3)4(50)15(75)5(29.4)5(41.7)17(73.9)6(31.6)5(41.7)16(76.2)
 Yes14(66.7)5(50)6(28.6)13(68.4)7(58.3)5(23.8)16(66.7)4(50)5(25)12(70.6)7(58.3)6(26.1)13(68.4)7(58.3)5(28.8)
Adjuvant Chemotherapy0.520.4230.4590.4590.423
 No7(33.3)3(30)4(19)7(36.8)2(16.7)4(19)6(25)2(25)5(25)6(35.3)3(25)4(17.4)7(36.8)2(16.7)4(19)
 Yes14(85.7)7(70)17(81)12(63.2)10(83.3)17(81)18(75)6(75)1575)11(64.7)9(75)19(82.6)12(63.2)10(83.3)17(81)
Adjuvant Radiotherapy0.5610.8030.6910.803
 No3(14.3)3(30)3(14.3)3(15.8)3(25)3(14.3)3(12.5)2(25)4(20)3(17.6)2(16.7)4(17.4)3(15.8)3(25)3(14.3)
 Yes18(85.7)7(70)18(85.7)16(84.2)9(75)18(85.7)21(87.5)6(75)16(80)14(82.4)10(83.3)19(82.6)16(84.2)9(75)18(85.7)

C1: cluster 1; C2: cluster 2; C3: cluster 3; a: mean (sd) or median (min, max).

Clinical comparison of 52 patients between clusters. C1: cluster 1; C2: cluster 2; C3: cluster 3; a: mean (sd) or median (min, max).

Comparison of survival and predicted survival between clusters

None of the methods created clusters showing significant differences for OS, SS or RFS. Analysis of patients’ simulated survival data using PREDICT tool are presented in Table 3 and show a predicted survival gradient for clusters obtained with the 5 methods for OS and SS. There were significant differences for 5-year pOS between clusters obtained with K-sparse (p = 0.021), Sparse K-means (p = 0.049), Spectral and clustering (p = 0.021). The five methods showed a significant difference for 5-year pSS between clusters. In terms of 10-year pOS, there were no significant differences between clusters obtained by any of the 5 methods. In contrast, for 10-year pSS, the 5 methods showed significant differences between clusters. Patients in cluster 3 clearly showed the poorest predicted survival.
Table 3

Comparison of prediction for overall and specific survival between clusters at 5 and 10-year.


Predict 5-year
Predict 10-year
Overall Survival
Specific Survival
Overall Survival
Specific Survival
MethodsNo. of patients% [95% CI]P-value% [95% CI]P-value% [95% CI]P-value% [95% CI]P-value
K-sparse0.0210.0020.0770.004
Cluster 1 (n = 19)77% [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82]87% [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90], [91]58% [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65]80% [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86]
Cluster 2 (n = 12)71% [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82]81% [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90]53% [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66]75% [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85]
Cluster 3 (n = 20)59% [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69]68% [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74]41% [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52]62% [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69]
SIMLR0.10.0110.2410.009
Cluster 1 (n = 17)75% [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82]85% [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90], [91]55% [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64]77% [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84]
Cluster 2 (n = 12)72% [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82]83% [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90], [91]55% [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67]79% [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87]
Cluster 3 (n = 22)61% [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70]71% [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77]43% [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53]64% [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70]
Sparse K-means0.0490.0270.2030.024
Cluster 1 (n = 24)74% [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80]84% [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89]54% [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63]80% [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86]
Cluster 2 (n = 7)72% [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87]83% [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90], [91], [92], [93], [94]56% [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72]75% [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85]
Cluster 3 (n = 20)61% [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69]70% [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78]42% [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52]62% [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69]
Spectral clustering0.0210.0020.0770.004
Cluster 1 (n = 19)77% [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83]77% [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90], [91]58% [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65]82% [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86]
Cluster 2 (n = 12)71% [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81]71% [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90]52% [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64]75% [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85]
Cluster 3 (n = 20)59% [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68]69% [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76]41% [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52]62% [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69]
PCA K-means0.0550.0090.0850.008
Cluster 1 (n = 21)77% [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81]86% [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90], [91]58% [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65]79% [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85]
Cluster 2 (n = 10)69% [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81]80% [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87], [88], [89], [90]52% [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64]77% [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82], [83], [84], [85], [86]
Cluster 3 (n = 20)60% [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69]69% [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78]41% [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52]63% [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70]
Comparison of prediction for overall and specific survival between clusters at 5 and 10-year.

Comparison of the most impactful metabolites according to the five methods

To relate the impact of 449 metabolites to cluster construction, we ranked these metabolites extracted from each of the five methods based on their functional contributions to outputs. With this approach, we classified the relative impact of metabolites on cluster construction and on the identification of metabolic signatures. The highest-ranked metabolites were those that provided relevant information to the signature versus those that provided redundant information or no information. Among a total of 449 metabolites, 116 (26%) were selected by K-sparse clustering and 69 (15%) by Sparse K-means clustering. As for the three other methods, which don’t select sparse features, the number of metabolites remained equal to 449. The 50 most effective metabolites identified by the five methods are presented in Supplementary Table 2. Furthermore, a comparison of the top 50 metabolites in each of the 5 methods is presented using a Venn diagram (Fig. 3). Two metabolites were shared by the 5 methods (Creatine, l-Proline), 9 were shared by 4 methods (Betaine, Glutathione, Humulinic Acid A, Isoleucyl-Methionine, l-Carnitine, l-Methionine, l-Phenylalanine Triethanolamine, Alnustone), 28 were shared by 3 methods and 38 were shared by 2 methods (Table 4).
Fig. 3

Venn diagram of metabolic that were in common or unique to the five clustering methods.

Table 4

Table indicating which metabolites are in each intersection or are unique to a certain list.

Clustering MethodsNbrMetabolites
5K-SparsePCA K-meansSIMLRSparse K-meansSpectral clustering2Creatine; l-Proline;



4K-SparseSIMLRSparse K-meansSpectral clustering1Triethanolamine;
K-SparsePCA K-meansSIMLRSparse K-means2l-Methionine; l-Phenylalanine
K-SparsePCA K-meansSparse K-meansSpectral clustering2l-Carnitine; Betaine;
PCA K-meansSIMLRSparse K-meansSpectral clustering4Glutathione; Isoleucyl-Methionine; Humulinic acid A; Alnustone;



3K-SparseSIMLRSparse K-means1Hydroxyprolyl-Valine;
K-SparsePCA K-meansSparse K-means20Aminoadipic acid; Methylmalonic acid; 1b-Furanoeudesm-4(15)-en-1-ol acetate; Glycerophosphocholine; Lidocaine; Adenosine monophosphate; 2-Methyl-3-ketovaleric acid; Liqcoumarin; p-Cresol sulfate; 2-Methylbutyroylcarnitine; Methoxsalen; Citramalic acid; Hypoxanthine; l-Acetylcarnitine; Ethyl aconitate; Guanine; l-Glutamic acid; Uridine 5′-monophosphate; N1,N12-Diacetylspermine; 5-Aminoimidazole ribonucleotide
SIMLRSparse K-meansSpectral clustering42,5-Dichloro-4-oxohex-2-enedioate; Histidinyl-Isoleucine; 3-(4-Methyl-3-pentenyl)thiophene; (−)-Epigallocatechin
PCA K-meansSparse K-meansSpectral clustering3l-Isoleucine; Ascorbic acid; Neurine;



2K-SparseSparse K-means35-Hydroxyisourate; Hexanoylcarnitine; l-Glutamine;
K-SparsePCA K-means9Creatinine; Proline; betaine; Erythronic acid; Garcinia acid; Thiolutin; 4-Chloro-1H-indole-3-acetic acid; Niacinamide 3-Dehydroxycarnitine; Dihydrothymine;
SIMLRSpectral clustering215b-Cyprinol sulfate; 2′,4-Dihydroxy-4′,6′-dimethoxychalcone; Propenoylcarnitine; 5-Hydroxyindoleacetic acid; Phaseolic acid Lisuride; 2-Bromophenol; (alpha-D-mannosyl)7-beta-D-mannosyl-diacetylchitobiosyl-l-asparagine isoform B (protein); Plastoquinone 3; 2,2,4,4,-Tetramethyl-6-(1-oxopropyl)-1,3,5-cyclohexanetrione; 1-Pyrroline; Gingerol; Prehumulinic acid; 1-Methylpyrrolo[1,2-a]pyrazine; 5-(methylthio)-2,3-Dioxopentyl phosphate; Propionic acid; Isosakuranin; Phenmetrazine; Methionine sulfoxide; Glycerol; Carboxyphosphamide
SIMLRSparse K-means1Phosphoric acid;
PCA K-meansSparse K-means4I(−); l-Tyrosine; Gravelliferone; Valganciclovir;



1K-Sparse10Prolylhydroxyproline; Guanidoacetic acid; Histamine; PC-M6; l-Histidine; N-Acetyl-l-aspartic acid; 3-Mercaptohexyl hexanoate; Trimethylamine N-oxide; Pantothenic acid; Flunitrazepam
SIMLR143-Hydroxy-6,8-dimethoxy-7(11)-eremophilen-12,8-olide; Glycerol tripropanoate; Alanyl-Isoleucine; 1-(2,4,6-Trimethoxyphenyl)-1,3-butanedione; 1-Oxo-1H-2-benzopyran-3-carboxaldehyde; 1,3,11-Tridecatriene-5,7,9-triyne; N-Acetyl-l-methionine; 3-Methyl sulfolene; 5-(4-Acetoxy-3-oxo-1-butynyl)-2,2′-bithiophene; Ac-Ser-Asp-Lys-Pro-OH; Cyclic AMP; Benzothiazole; (±)-2-Methylthiazolidine; 2-Methylcitric acid
Spectral clustering132,3-diketogulonate; 2,5-Furandicarboxylic acid; Pyrrolidine; Piperidine; Beta-Alanine; Aspartyl-l-proline; Erythro-5-hydroxy-l-lysinium(1 + ); Acrylamide; 5-Hydroxylysine; S-Nitrosoglutathione; 2,2-dichloro-1,1-ethanediol; Valerenic acid; Dichloromethane
Sparse K-means3Erinapyrone C; Ergothioneine; N-Methylethanolaminium phosphate
PCA K-means4Dimethylglycine; Pipecolic acid; Methyl (9Z)-10′-oxo-6,10′-diapo-6-carotenoate; N-Desmethylvenlafaxine
Venn diagram of metabolic that were in common or unique to the five clustering methods. Table indicating which metabolites are in each intersection or are unique to a certain list.

Comparison between 5 methods of identified metabolic pathways

For a better understanding of metabolic dysregulation among BC subtypes, pathway analysis was performed. Identification of all the metabolic pathways highlighted by each of the 5 methods as shown in Supplementary Table 3. The most relevant pathways for each of the 5 methods are shown in Table 5. Sparse K-means identified only one statistically significant pathways, “cysteine and methionine metabolism”, involved in amino acid metabolism. K-Sparse identified 3 different pathways: “glycerolipid metabolism”, “Starch and sucrose metabolism” involved in carbohydrates metabolic pathway and “Aminoacyl-tRNA biosynthesis” involved in translation pathway. Spectral clustering identified 17 pathways, the 3 most important being “Glycine, serine and threonine metabolism”, “Alanine, aspartate and glutamate metabolism” and “Histidine metabolism and glutathione metabolism” involved in amino acid metabolic pathway. PCA K-means identified 10 pathways the 3 most important of which are “Alanine, aspartate and glutamate metabolism” involved in amino acid metabolic pathway, “Pyruvate metabolism” involved in carbohydrates metabolic/glucose oxidation pathway and “Citrate cycle (TCA cycle)” involved in energy metabolic pathway.
Table 5

List of significant relevant pathways identified by 5 methods.

K-Sparse method
Clusters ComparaisonInteraction metabolitePathway NameTotal CmpdaMatch StatusbRaw Pc-log(p)Impactd
C1 vs C3UDP – glucoseStarch and sucrose metabolism5010,01074,53880,1390
UDP – glucoseAmino sugar and nucleotide sugar metabolism8810,01074,53880,0928
UDP - glucose; Glyceric acidGlycerolipid metabolism3220,01534,18310,0206



SIMLR method

Clusters ComparaisonInteraction metabolitePathway NameTotal CmpdMatch StatusP Value-log(p)Impact

C1 VS C2Glutathione; Oxidized glutathione; Glycine; l-Glutamic acid; Pyroglutamic acid; Spermidine; Ornithine; Putrescine; Spermine; Cadaverine; Aminopropylcadaverine; Ascorbic acidGlutathione metabolism3812012,8260,3628
Ascorbic acid; Uridine diphosphate glucose; Pyruvic acid; D-Glucuronic acid 1-phosphate; Oxoglutaric acid;Ascorbate and aldarate metabolism455012,4690,1383
l-Tryptophan; N-Acetylserotonin; 5-Hydroxyindoleacetic acid; 2-Aminomuconic acid semialdehyde; 3-Hydroxyanthranilic acid; l-Kynurenine; Acetyl-N-formyl-5-methoxykynurenamine; Isophenoxazine;Tryptophan metabolism7980,00019,12330,2741
5′-Methylthioadenosine; N-Formyl-l-methionine; l-Homocysteine; l-Methionine; Glutathione; Phosphoserine; 3-Sulfinoalanine; l-Aspartyl-4-phosphate; Pyruvic acid;Cysteine and methionine metabolism5690,00087,16740,2509
l-Glutamine; Phosphoribosylformylglycineamidine; Cyclic AMP; Adenosine monophosphate; Adenosine; Inosine; Adenine; Hypoxanthine; Guanine; Uric acid; 5-Hydroxyisourate; Guanosine; Adenosine diphosphate ribose; 5-Aminoimidazole ribonucleotide; Glyoxylic acid; Glycine; Adenosine 3′,5′-diphosphate;Purine metabolism92170,00116,80910,2048
Glyoxylic acid; Oxoglutaric acid; N-Formyl-l-methionine; Glycolic acid; Glyceric acid; Pyruvic acid;Glyoxylate and dicarboxylate metabolism5060,00275,92810,268
l-Glutamine; Ornithine; Citrulline; l-Arginine; l-Glutamic acid; N-Acetylornithine; l-Proline; Hydroxyproline; Guanidoacetic acid; Creatine; 4-Guanidinobutanoic acid; N2-Succinyl-l-ornithine; Putrescine; Spermidine; N-Acetylputrescine; Pyruvic acid; Glyoxylic acid; Spermine;Arginine and proline metabolism77190,00535,2380,6514
Oxoglutaric acid; Oxalosuccinic acid; Pyruvic acid;Citrate cycle (TCA cycle)2030,00754,89910,176
D-Xylose; Uridine diphosphate glucose; D-Glucuronic acid 1-phosphate; Pyruvic acid;Pentose and glucuronate interconversions5340,00764,88210,0394
2-Hydroxyethanesulfonate; Pyruvic acid; 3-Sulfinoalanine;Taurine and hypotaurine metabolism2030,01544,17540,0324
Glyceric acid; Betaine; Guanidoacetic acid; Dimethylglycine; Glycine; Phosphoserine; l-Threonine; O-Phosphohomoserine; l-Aspartyl-4-phosphate; Creatine; Glyoxylic acid; Pyruvic acid; l-TryptophanGlycine, serine and threonine metabolism48130,0184,01540,46986
Uridine diphosphate glucose; D-Glucuronic acid 1-phosphate; N-Acetyl-D-Glucosamine 6-Phosphate; Uridine diphosphate-N-acetylglucosamine; Cytidine monophosphate N-acetylneuraminic acid; D-Glucose; D-XyloseAmino sugar and nucleotide sugar metabolism8870,01873,97830,1417
Formiminoglutamic acid; l-Glutamic acid; Urocanic acid; l-Histidine; Histamine; D-Erythro-imidazole-glycerol-phosphate; Ergothioneine; Hydantoin-5-propionic acid; Imidazole acetol-phosphate; Oxoglutaric acid;Histidine metabolism44100,04123,19030,3705
Pyridoxamine; Oxoglutaric acid; 3-Hydroxy-2-methylpyridine-4,5-dicarboxylate; Pyruvic acid;Vitamin B6 metabolism3240,04123,18980,0773



C1 VS C3Formiminoglutamic acid; l-Glutamic acid; Urocanic acid; l-Histidine; Histamine; D-Erythro-imidazole-glycerol-phosphate; Ergothioneine; Hydantoin-5-propionic acid; Imidazole acetol-phosphate; Oxoglutaric acid;Histidine metabolism44100,01394,27520,3705
Phenylpyruvic acid; l-Phenylalanine; l-Tyrosine; 3-Dehydroquinate; l-Tryptophan;Phenylalanine, tyrosine and tryptophan biosynthesis2750,01893,96870,099
l-Tryptophan; N-Acetylserotonin; 5-Hydroxyindoleacetic acid; 2-Aminomuconic acid semialdehyde; 3-Hydroxyanthranilic acid; l-Kynurenine; Acetyl-N-formyl-5-methoxykynurenamine; Isophenoxazine;Tryptophan metabolism798016,4090,2741



C2 VS C3Glutathione; Oxidized glutathione; Glycine; l-Glutamic acid; Pyroglutamic acid; Spermidine; Ornithine; Putrescine; Spermine; Cadaverine; Aminopropylcadaverine; Ascorbic acid;Glutathione metabolism3812016,1330,3628
Ascorbic acid; Uridine diphosphate glucose; Pyruvic acid; D-Glucuronic acid 1-phosphate; Oxoglutaric acidAscorbate and aldarate metabolism455013,0960,1383
5′-Methylthioadenosine; N-Formyl-l-methionine; l-Homocysteine; l-Methionine; Glutathione; Phosphoserine; 3-Sulfinoalanine; l-Aspartyl-4-phosphate; Pyruvic acid;Cysteine and methionine metabolism5690,00019,85480,2509
Phenylpyruvic acid; l-Phenylalanine; l-Tyrosine; 3-Dehydroquinate; l-Tryptophan;Phenylalanine, tyrosine and tryptophan biosynthesis2750,00018,98140,099
l-Histidine; l-Phenylalanine; l-Arginine; l-Glutamine; Glycine; l-Methionine; l-Lysine; l-Isoleucine; l-Threonine; l-Tryptophan; l-Tyrosine; l-Proline; l-Glutamic acid; Phosphoserine;Aminoacyl-tRNA biosynthesis75140,00028,7580,1127
Glyoxylic acid; Oxoglutaric acid; N-Formyl-l-methionine; Glycolic acid; Glyceric acid; Pyruvic acid;Glyoxylate and dicarboxylate metabolism5060,00047,72710,268
l-Glutamine; Phosphoribosylformylglycineamidine; Cyclic AMP; Adenosine monophosphate; Adenosine; Inosine; Adenine; Hypoxanthine; Guanine; Uric acid; 5-Hydroxyisourate; Guanosine; Adenosine diphosphate ribose; 5-Aminoimidazole ribonucleotide; Glyoxylic acid; Glycine; Adenosine 3′,5′-diphosphate;Purine metabolism92170,00077,3060,2048
Malonic acid; Beta-Alanine; Spermine; Spermidine; Dihydrouracil; Pantothenic acid; Uracil; l-Histidinebeta-Alanine metabolism2880,00126,75680,3577
Uridine 5′-monophosphate; l-Glutamine; Dihydrouracil; Cytidine monophosphate; Cytidine; Cytosine; Uracil; Dihydrothymine; Uridine diphosphate glucose; Malonic acid; Ureidosuccinic acid; Beta-Alanine; Methylmalonic acid;Pyrimidine metabolism60130,00146,58170,2756
Pantothenic acid; Dihydrouracil; Beta-Alanine; Pyruvic acid; Adenosine 3′,5′-diphosphate; Uracil;Pantothenate and CoA biosynthesis2760,00236,08790,2736
l-Phenylalanine; Phenylpyruvic acid; Benzoic acid; Hippuric acid; Pyruvic acid; l-Tyrosine;Phenylalanine metabolism4560,00724,93640,2468
l-Glutamic acid; l-Glutamine; Oxoglutaric acidD-Glutamine and D-glutamate metabolism1130,01244,390,139
l-Glutamine; Ornithine; Citrulline; l-Arginine; l-Glutamic acid; N-Acetylornithine; l-Proline; Hydroxyproline; Guanidoacetic acid; Creatine; Creatinine; 4-Guanidinobutanoic acid; N2-Succinyl-l-ornithine; Putrescine; Spermidine; N-Acetylputrescine; Pyruvic acid; Glyoxylic acid; Spermine;Arginine and proline metabolism77190,01694,0820,6514
2-Hydroxyethanesulfonate; Pyruvic acid; 3-Sulfinoalanine;Taurine and hypotaurine metabolism2030,02153,84110,0324
N-Acetyl-l-aspartic acid; Pyruvic acid; Ureidosuccinic acid; Oxoglutaric acid; l-Glutamine; l-Glutamic acid; 2-Keto-glutaramic acid;Alanine, aspartate and glutamate metabolism2470,02213,81080,4122
Pyridoxamine; Oxoglutaric acid; 3-Hydroxy-2-methylpyridine-4,5-dicarboxylate; Pyruvic acid;Vitamin B6 metabolism3240,02673,62350,0773
Oxoglutaric acid; Oxalosuccinic acid; Pyruvic acidCitrate cycle (TCA cycle)2030,03023,50150,176
Glyceric acid; Betaine; Guanidoacetic acid; Dimethylglycine; Glycine; Phosphoserine; l-Threonine; O-Phosphohomoserine; l-Aspartyl-4-phosphate; Creatine; Glyoxylic acid; l-TryptophanGlycine, serine and threonine metabolism48130,03723,29140,4699
Uridine diphosphate glucose; Glycerol 3-phosphate; Glycerol; Glyceric acid; Galactosylglycerol;Glycerolipid metabolism3250,04273,15460,2162
D-Xylose; Uridine diphosphate glucose; D-Glucuronic acid 1-phosphate; Pyruvic acid;Pentose and glucuronate interconversions5340,04273,15360,0394



Sparse K-means method

Clusters ComparaisonInteraction metaboliteTotal CmpdMatch StatusRaw p-log(p)Impact

C1 VS C2l-Methionine; GlutathioneCysteine and methionine metabolism5620.0074.90.0454
C1 VS C3l-Methionine; Glutathione;Cysteine and methionine metabolism5620.00206.20.00454



Spectral clustering method

Clusters ComparaisonInteraction metabolitePathway NameTotal CmpdMatch StatusRaw p-log(p)Impact

C1 VS C3Iminoaspartic acid; Quinolinic acid; Niacinamide; Pyruvic acid; Propionic acid;Nicotinate and nicotinamide metabolism4450,00246,02060,0712
Glyceric acid; Betaine; Guanidoacetic acid; Dimethylglycine; Glycine; Phosphoserine; l-Threonine; O-Phosphohomoserine; l-Aspartyl-4-phosphate; Creatine; Glyoxylic acid; l-TryptophanGlycine, serine and threonine metabolism48130,00405,51000,4699
5′-Methylthioadenosine; N-Formyl-l-methionine; l-Homocysteine; l-Methionine; Glutathione; Phosphoserine; 3-Sulfinoalanine; l-Aspartyl-4-phosphate; Pyruvic acid;Cysteine and methionine metabolism5690,00984,62320,2509
Formiminoglutamic acid; l-Glutamic acid; Urocanic acid; l-Histidine; Histamine; D-Erythro-imidazole-glycerol-phosphate; Ergothioneine; Hydantoin-5-propionic acid; Imidazole acetol-phosphate; Oxoglutaric acid;Histidine metabolism44100,01014,59610,3705
xoglutaric acid; Oxalosuccinic acid; Pyruvic acid;Citrate cycle (TCA cycle)2030,01714,07100,1760
Pyruvic acid; l-Threonine; l-Isoleucine;Valine, leucine and isoleucine biosynthesis2730,01784,02770,0350
D-Xylose; Uridine diphosphate glucose; D-Glucuronic acid 1-phosphate; Pyruvic acid;Pentose and glucuronate interconversions5340,02103,86090,0394
D-Glucose; Glyceric acid; Pyruvic acid;Pentose phosphate pathway3230,02323,76220,0218
Pyruvic acid; l-Lactic acid; D-Glucose;Glycolysis or Gluconeogenesis3130,02493,69280,0953
Pyruvic acid; l-Lactic acid;Pyruvate metabolism3220,02743,59550,3201
l-Glutamic acid; Pyruvic acid; Butyric acid; Oxoglutaric acid;Butanoate metabolism4040,02833,56440,0852
2-Hydroxyethanesulfonate; Pyruvic acid; 3-Sulfinoalanine;Taurine and hypotaurine metabolism2030,02873,55250,0324
Glyoxylic acid; Oxoglutaric acid; N-Formyl-l-methionine; Glycolic acid; Glyceric acid; Pyruvic acid;Glyoxylate and dicarboxylate metabolism5060,03033,49660,2680
Ascorbic acid; Uridine diphosphate glucose; Pyruvic acid; D-Glucuronic acid 1-phosphate; Oxoglutaric acid;Ascorbate and aldarate metabolism4550,03303,41040,1383
Epinephrine; Dopamine; l-Tyrosine; Homovanillic acid; Pyruvic acid;Tyrosine metabolism7650,03853,25800,1750
N-Acetyl-l-aspartic acid; Pyruvic acid; Ureidosuccinic acid; Oxoglutaric acid; l-Glutamine; l-Glutamic acid; 2-Keto-glutaramic acid;Alanine, aspartate and glutamate metabolism2470,03903,24310,4122
Pyridoxamine; Oxoglutaric acid; 3-Hydroxy-2-methylpyridine-4,5-dicarboxylate; Pyruvic acid;Vitamin B6 metabolism3240,04473,10740,0773



PCA K-means method

Clusters ComparaisonInteraction metabolitePathway NameTotal CmpdMatch StatusRaw p-log(p)Impact

C1 vs C3Iminoaspartic acid; Quinolinic acid; Niacinamide; Pyruvic acid; Propionic acid;Nicotinate and nicotinamide metabolism4450,0035,94120,0712
Oxoglutaric acid; Oxalosuccinic acid; Pyruvic acid;Citrate cycle (TCA cycle)2030,0114,48650,1760
Epinephrine; Dopamine; l-Tyrosine; Homovanillic acid; Pyruvic acid;Tyrosine metabolism7650,0243,73110,1750
Pyruvic acid; l-Lactic acid;Pyruvate metabolism3220,0433,15070,3201
D-Xylose; Uridine diphosphate glucose; D-Glucuronic acid 1-phosphate; Pyruvic acid;Pentose and glucuronate interconversions5340,0443,12140,0394
Pyruvic acid; l-Threonine; l-Isoleucine;Valine, leucine and isoleucine biosynthesis2730,0453,11070,0350
Ascorbic acid; Uridine diphosphate glucose; Pyruvic acid; D-Glucuronic acid 1-phosphate; Oxoglutaric acid;Ascorbate and aldarate metabolism4550,0453,09260,1383
l-Glutamic acid; Pyruvic acid; Butyric acid; Oxoglutaric acid;Butanoate metabolism4040,0463,08430,0852
D-Glucose; Glyceric acid; Pyruvic acid;Pentose phosphate pathway3230,0463,07690,0218
N-Acetyl-l-aspartic acid; Pyruvic acid; Ureidosuccinic acid; Oxoglutaric acid; l-Glutamine; l-Glutamic acid; 2-Keto-glutaramic acidAlanine, aspartate and glutamate metabolism2470,0483,04460,4122

Total cmpd is the total number of compounds in the pathway.

Hits is the actual matched number from the uploaded data.

Raw p is the original p-value calculated from the pathway analysis.

Impact is the pathway impact value calculated from pathway topology analysis.

List of significant relevant pathways identified by 5 methods. Total cmpd is the total number of compounds in the pathway. Hits is the actual matched number from the uploaded data. Raw p is the original p-value calculated from the pathway analysis. Impact is the pathway impact value calculated from pathway topology analysis. Finally, with 30 identified pathways, SIMLR is the method that identified the most metabolic pathways. Of these, the 3 most important highlighted metabolic pathways are “arginine and proline metabolism”, “glycine, serine and threonine metabolism” and “alanine, aspartate and glutamate metabolism”, involved in amino acid metabolic pathways. The Venn diagram (Fig. 4) shows the overlap of pathways detected by the five methods. Amino acid metabolism appeared to be the most frequently modified pathway. Enrichment and pathway analyses also showed modifications in glucose metabolism. From the biological point of view, SIMLR and spectral clustering are the two methods that identified the most relevant metabolic pathways.
Fig. 4

Venn diagram of pathways that were in common or unique to the five clustering methods.

Venn diagram of pathways that were in common or unique to the five clustering methods.

Comparison of intensity of metabolites between the 5 methods

Among amino acid and glucose metabolisms, fourteen related metabolites were selected as potential biomarkers in BC [54], [55], [56], [57]. As shown in Supplementary Fig. 4, the intensities of these 14 metabolites were compared between the 3 clusters for each of the 5 methods. The intensity of Uridine diphosphate (UDP) glucose, Guanine, l-Glutamine, l-Glutamic acid, l-Isoleucine, l-Proline, l-Methionine, l-Phenylalanine, Pyruvic acid, Spermine, Glutathione, Creatine, l-Carnitine and l-Acetylcarnitine were statistically significant between at least one of the clusters. The five methods agree that cluster 3 patients have low levels of Creatine, l-acetylcarnitine, l-Glutamic acid and high levels of Guanine, l-Isoleucine, l-Phenylalanine, Pyruvic acid and Spermine (Fig. 5). These metabolite levels seem to be predictive of poor prognosis [57], [58], [59].
Fig. 5

Boxplot of the 8 metabolites extracted from 5 ML methods.

Boxplot of the 8 metabolites extracted from 5 ML methods.

Discussion

From a machine learning perspective

To the best of our knowledge, this proof-of-concept study is the first to compare different unsupervised ML methods to identify metabolomics-based prognostic signatures in BC. Analyses were performed intentionally without any prior clinical or biological assumptions. Clinical and biological interpretations were performed only after cluster identification. The objective of our study was to compare different unsupervised ML algorithms for feature selection from untargeted metabolomic data and to evaluate the capacity of these methods to select relevant features for further use in prediction models. This study did not seek to highlight significant differences but rather to assess how unsupervised methods might behave with high-dimension metabolic data and to open up new perspectives in the particularly active domain of BC phenotype predictors. We demonstrated that the K-sparse and SIMLR methods have a higher clustering performance compared with the three other popular unsupervised ML methods in detecting groups of patients with BC using metabolomic data. Interestingly, even though the spectral method is a little less clinically efficient than the k-sparse and SIMLR methods, it identified relevant metabolic pathways. Our study suffers from various limitations, namely the relatively small number of patients and the monocentric and retrospective nature of the study. Besides, our results could not be validated on an external cohort. The clustering performances were assessed only by internal validation based on silhouette value. Indeed, we could not compare the labels obtained from our classification with the true labels to calculate the accuracy of the classification since the true labels were unknown. Other unsupervised ML methods such as model-based clustering, bi-clustering and deep learning may be of value in this analysis and should be further explored. Yet it is worth noting that, even though deep learning methods are of particular interest in many fields, they necessitate a very large number of patients to be efficiently trained and may therefore not be suitable for small metabolomics datasets obtained on real life patients, such as the one we have used. While obtaining imaging or clinical data concerning several thousands of patients seems achievable, obtaining metabolomics data for that many patients is currently much more complicated. Furthermore, even though some efforts are being made to tackle this issue [60], it is currently impossible to understand which features are responsible for the outcome when using deep-learning clustering techniques. It would therefore be impossible to understand the metabolic differences underlying different patient clusters if deep learning clustering was used. These considerations raise important questions: in the future, on what basis should decisions be made? On results from a single method? Or on results provided by several methods? In view of the findings we have highlighted, it seems that decisions should be taken collegially, i.e. based on the results of a set of methods, as at multidisciplinary consultation meetings involving health professionals from different disciplines and whose skills are essential to take decisions ensuring patients the best possible care according to the state of the science.

From a clinical perspective

From a clinical point of view, the methods were able to highlight three distinct groups of patients with different clinical profiles. Patients identified in cluster 1 may be considered to have the best prognosis, patients in cluster 2 an intermediate prognosis, while patients in cluster 3 may be considered to have the worst prognosis. The results in Table 2 show that the tumors of patients in cluster 1 were predominantly non-invasive and non-proliferative, whereas the tumors of cluster 3 patients were mainly invasive and proliferative. Tumors in cluster 2 were rather invasive but not proliferative, hence the intermediate prognosis. We hypothesize that these patients would have an intermediate (atypical) biological profile, which is why the methods are discordant. We further evidence heterogeneity within the triple-negative BC subpopulation with most of the patients classified in cluster 3. However, a third of the triple-negative patients were in cluster 1 Recent molecular profiling studies of triple-negative BC using parallel sequencing and other “omics” technologies have also uncovered an unexpectedly high level of heterogeneity as well as a number of common features [61], [62]. In addition, no significant difference between clusters could be demonstrated in terms of age, histologic type, lymph node involvement, metastasis or survival (OS, SS or RFS). Indeed, with a median follow-up of only 48.5 months, this duration is insufficient to demonstrate a significant difference in terms of OS, SS, or RFS. Nevertheless, it is quite easy to predict that patients in cluster 3 have the highest risk of progression and that, conversely, patients in cluster 1 have the lowest risk of progression. To confirm this intuition and try to reduce this short follow-up limitation, we analyzed simulated survival data obtained with the PREDICT tool. With a 5-year pOS rate at around 75% for cluster 1, 70% for cluster 2 and 60% for cluster 3, in-silico analyses have demonstrated their high potential value [28], [63], [64] and confirmed that patients in cluster 3 have a poorer prognosis [65], [66]. One limitation of our study could be the representativity of our population, e.g. it is recognized that BCs in younger patients (<40 years) are more aggressive [67]. Our study did not include a large number of young patients, which could explain why no significant difference was demonstrated in terms of age between clusters. Similarly, with only three patients with invasive lobular carcinoma (6%), our results did not identify a metabolic signature associated with this phenotype. Previous studies have shown a survival benefit in favor of invasive lobular carcinoma [68], [69] and metabolomic studies focused on this particular type of BC could provide valuable biological information. Furthermore, due to the over-representation of hormonal-receptor negative tumors (48%) in our population compared to the literature [70], our population could have had unfavorable prognosis. This bias may result from our method of tumor selection. We decided to analyze frozen samples available in our biobank. Obviously, hormonal-receptor negative, triple-negative, Her-2-positive tumors are more often frozen and stored for further molecular testing and inclusion in clinical trials. In the present study, it is interesting to note that the five methods classified 73% of the patients in the same cluster. Among the 27% of patients classified differently by at least one of the methods, 9.5% of patients were classified heterogeneously by the five methods. Indeed, for each of these 5 patients, three methods classified them in one cluster and 2 others in another cluster without any connection between the types of methods used. Moreover, it is interesting to note that the different methods classified patients, on the one hand, in either the good prognostic cluster or the intermediate prognostic cluster or, on the other, in either the intermediate prognostic cluster or the poor prognostic cluster, but never in the good prognostic cluster or the poor prognostic cluster. A clinical analysis of these 5 patients showed that they had atypical clinical profiles, probably due to particular biological profiles. These atypical profiles would explain why no classification consensus could be highlighted. Overall, ML methods must remain a decision-making tool for the clinician, especially in cases where patients have particular clinical and biological characteristics. To avoid possible medical errors, the final responsibility for the decision lies with the clinician [71]. Finally, the initial clinical objective of this study was to define a metabolomic signature to refine the current classification and help the clinician in his chemotherapy prescription. This paper is the result of methodological research analyzing the best ML methods to develop this new tool. The patients selected were therefore patients eligible for adjuvant chemotherapy. An analysis of the metastatic population could help define a specific signature of metastatic status and/or a signature associated to survival. However, the use of biopsy faces two practical difficulties: 1) the intratumoral and inter-site heterogeneity that could be overcome through the analysis of blood or urine samples; and 2) the amount of material available once the pathologic analyses essential for patient management have been performed. Metabolomic analysis on paraffin slides could facilitate access to specimens and limit the amount of material required.

From a biological perspective

From a physiological point-of-view, this study extends the molecular stratification of BC to metabolomic profiles. Indeed, our results suggest that dysregulation of metabolic pathways exists between BC subtypes and that a particular amino acid profile characterizes the different BC histologic subtypes. Dysregulations of amino acid metabolism are well-known key events during cancer development [72] and are emerging hallmarks of cancers [73], [74]. Amino acids serve not only as building blocks in protein synthesis but also as energy sources favoring cancer cell proliferation and growth [75]. Of interest, we identified significant differences between the BC subtypes of three metabolic pathways (i.e. Glycolysis and lactate production, Glutaminolysis, and amino acid) that play a pivotal role in BC growth [76], [77]. Using the five methods, we consistently found that patients in cluster 3 showed higher levels of Guanine, l-Isoleucine, l Methionine, l-Phenylalanine, Pyruvic acid, Spermine and low levels of Creatine, l-Acetylcarnitine and l-Glutamic acid. Our results suggested that these metabolites could be candidate biomarker predictors of poorer prognosis [78], [79], [80], [81], [82]. All these results are consistent with the literature [57], [83], [84], [85], [86]. Given the exploratory nature of our study, we decided to use an FDR rate of 0.25 as a threshold in order to identify relevant candidate pathways (https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ). A validation of these pathways, during a study whose main objective will be to evaluate the usefulness of our metabolomics signatures for decision-making, will need to be established with the use of a lower False Discovery Rate or Family Wise Error Rate (<0.05). Indeed, to meet the biosynthetic needs associated with rapid proliferation, cancer cells must increase the import of nutrients. Two main metabolites are essential for biosynthesis and survival in mammalian cells, and particularly in cancer cells: glucose [87] and glutamine [88]. The increased glucose uptake in tumors compared to other healthy and non-proliferative tissues was first described more than 90 years ago by Otto Warburg [89]. Glucose is the primary energy source of all cells because of its involvement in many processes such as glycolysis or the Krebs cycle [90] in mitochondria. Unlike healthy cells that adapt to available substrates (glucose/fatty acids/proteins), some tumor cells are addicted to glucose. The other important point is that, once metabolized, tumor cells will prefer lactic fermentation to the Krebs cycle. Lastly, the precise etiology of BC is still unknown even though some genetic, epigenetic and environmental factors have been identified [91]. It has been conclusively demonstrated that cancer cell metabolism is heavily influenced by microenvironmental factors, including nutrient availability. Sullivan and coworkers [92] found that diet affects local nutrient availability. This effect can lead to substantial changes in the metabolism of tumor cells, thereby modifying the response of these cells to drugs targeting metabolism. Drugs capable of inhibiting tumor proliferation may then become ineffective. Therefore, knowledge of microenvironmental nutrient levels is essential to a better understanding of tumor metabolism. Outcomes for cancer patients vary greatly. The classification of BC into subtypes has been was defined in the literature on the basis of molecular characterization of proteomics (single omic). This has helped improve prognosis and personalized treatment. These considerations have motivated efforts to produce large amounts of multi-omic data such as TCGA [93] and ICGC [94]. However, current algorithms still face challenges and need to integrate omic data [95], [96], [97], [98]. Defining BC subtypes using multi-omic data could help to better understand some of the dark areas that still persist in the field of tumor mechanisms in order to offer even more personalized treatments.

Conclusion

In the era of personalized medicine, OMICS science (genomics, transcriptomics, proteomics, and metabolomics) must contribute to the quest for cancer-specific biomarkers. The present study argues in favor of further research in this domain. Metabolomics is emerging as a relevant and promising tool for the classification of BC to enable more precise diagnosis [54], [99], [100], [101]. Even though it is less accurate than the targeted approach, untargeted metabolomics nevertheless permits identification and quantification of a vast number of major metabolites. Thus, this approach presents a particular interest in the search for new candidate biomarkers [102], [103], [104] and could be applied in everyday medical practice given that the cost and duration of metabolomic analyses are relatively low. However, due to the retrospective design of our study and the small number of patients recruited, our results need to be validated in a larger cohort and in the context of a prospective clinical trial.

Funding

The authors declare no competing financial interests.

CrediT authorship contribution statement

Jocelyn Gal: Methodology, Formal analysis, Writing - original draft. Caroline Bailleux: Writing - original draft. David Chardin: Software, Writing - original draft. Thierry Pourcher: Conceptualization, Writing - review & editing. Julia Gilhodes: . Lun Jing: . Jean-Marie Guigonis: Methodology, Writing - review & editing. Jean-Marc Ferrero: Data curation. Gerard Milano: Writing - review & editing. Baharia Mograbi: Writing - review & editing. Patrick Brest: Writing - review & editing. Yann Chateau: . Olivier Humbert: Conceptualization, Writing - review & editing. Emmanuel Chamorey: Supervision, Methodology, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  1 in total

1.  Development and Validation of a New Multiparametric Random Survival Forest Predictive Model for Breast Cancer Recurrence with a Potential Benefit to Individual Outcomes.

Authors:  Huan Li; Ren-Bin Liu; Chen-Meng Long; Yuan Teng; Lin Cheng; Yu Liu
Journal:  Cancer Manag Res       Date:  2022-03-01       Impact factor: 3.989

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.