Literature DB >> 26355846

Clustering of Expression Data in Chronic Lymphocytic Leukemia Reveals New Molecular Subdivisions.

Sally Yepes1, Maria Mercedes Torres1, Rafael E Andrade2.   

Abstract

Although the identification of inherent structure in chronic lymphocytic leukemia (CLL) gene expression data using class discovery approaches has not been extensively explored, the natural clustering of patient samples can reveal molecular subdivisions that have biological and clinical implications. To explore this, we preprocessed raw gene expression data from two published studies, combined the data to increase the statistical power, and performed unsupervised clustering analysis. The clustering analysis was replicated in 4 independent cohorts. To assess the biological significance of the resultant clusters, we evaluated their prognostic value and identified cluster-specific markers. The clustering analysis revealed two robust and stable subgroups of CLL patients in the pooled dataset. The subgroups were confirmed by different methodological approaches (non-negative matrix factorization NMF clustering and hierarchical clustering) and validated in different cohorts. The subdivisions were related with differential clinical outcomes and markers associated with the microenvironment and the MAPK and BCR signaling pathways. It was also found that the cluster markers were independent of the immunoglobulin heavy chain variable (IGVH) genes mutational status. These findings suggest that the microenvironment can influence the clinical behavior of CLL, contributing to prognostic differences. The workflow followed here provides a new perspective on differences in prognosis and highlights new markers that should be explored in this context.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26355846      PMCID: PMC4565688          DOI: 10.1371/journal.pone.0137132

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Chronic lymphocytic leukemia (CLL) is one the most frequently occurring leukemias in adults in Western countries and is characterized by mature B cell accumulation in the blood, bone marrow and secondary lymphoid organs. CLL patients can be divided into two major groups based on whether their immunoglobulin heavy chain variable region (IGVH) genes are mutated or unmutated. Patients with an unmutated IGVH gene have a less favorable prognosis than patients with a mutated IGVH gene [1, 2]. Different chromosomal aberrations, such as deletions in 11q, 13q, or 17p and trisomy 12, have also been found in CLL patients, with varied prognostic implications [3]. Common genetic causes have not yet been identified [4], but recurrent mutations in TP53 and ATM and new mutations in NOTCH1, SF3B1, MYD88, BIRC3 and FBXW7 have been identified in recent years by next-generation sequencing [5]. Little research has been performed to examine the natural clustering of CLL patient samples or to identify subtypes based on gene expression patterns, partly because expression studies in CLL patients have focused on the analysis and comparison of established disease subtypes. However, the identification of CLL patient groups is a current research goal, the realization of which could contribute to the identification of different prognostic subtypes and help to explain the heterogeneity in the clinical behavior of the disease. The main purpose of this study was to assess the possibility of detecting molecular subtypes of CLL patients based on gene expression microarrays in a relatively large group of samples obtained by merging expression readouts. If so, the goal was to confirm subdivisions in different cohorts, identify markers in the detected subgroups and explore the clinical and biological implications. We followed the methodological workflow presented in Fig 1. Briefly, microarray datasets from two different CLL expression studies were individually preprocessed, merged and corrected for non-biological variation. The resulting pooled data were used to identify stable clusters, or subgroups of patients with similar gene expression patterns. To this end, we applied different unsupervised clustering methods to confirm the structure in the data (non-negative matrix factorization NMF clustering, hierarchical clustering and multidimensional scaling). Cluster analysis was performed in 4 other independent cohorts. To identify cluster-specific genes, we identified genes that were differentially expressed between the clusters using the significance analysis of the microarray SAM method in both the merged data and individual cohorts.
Fig 1

Methodology flow chart.

Panel A shows the methodological steps followed for clustering analysis, and panel B shows the steps for the biological evaluation of the clusters obtained.

Methodology flow chart.

Panel A shows the methodological steps followed for clustering analysis, and panel B shows the steps for the biological evaluation of the clusters obtained. The resulting genes were analyzed in relation with the biology of the disease, pathway enrichment, and predictive role. The survival implication of the clusters and the individual contribution of cluster-specific markers to survival were evaluated. The relationship of the clusters to IGVH mutational status was also analyzed. A detailed explanation of the methodology can be found in the Materials and Methods section.

Materials and Methods

A schematic description of the workflow is presented in Fig 1. Panel A shows the methodological steps followed for clustering analysis, and panel B shows the steps for the biological evaluation of the clusters.

Clustering Analysis

Dataset and Array Preprocessing

The present study used microarray expression profiles of CLL obtained from the Gene Expression Omnibus (GEO) database of the National Center for Biotechnology Information (NCBI). GSE39671: (n = 130) [6] and GSE22762: (n = 107) [7] (both analyzed with the Affymetrix Human Genome U133 Plus 2.0 Array) were chosen for clustering analysis because they contain data about time to treatment (TTT) and overall survival (OS), respectively, and have suitable, comparable and large numbers of samples. They were independently preprocessed before combining them in one dataset for clustering and subsequent analysis. Another 4 cohorts that were analyzed using different microarray platforms were also chosen and preprocessed for clustering analysis and to explore the relationship with IGVH mutational status. These cohorts correspond to the following: GSE46261: (n = 211), Affymetrix Human Gene 1.0 ST Array [8], GSE9992: (n = 60), Affymetrix Human Genome U133A Array [9], GSE2466: (n = 100), Affymetrix Human Genome U95A Array [10] and GSE38611: (n = 136) Affymetrix Human Gene 1.0 ST Array [11]. Once the studies were selected, raw gene expression data from each study were independently preprocessed; this process comprised 3 steps: 1) background correction to adjust the intensity readings for nonspecific signals; 2) adjustment of the intensity readings for technical variability to ensure that the measurements of all of the samples were comparable (normalization); and 3) computation of a summary value for the different probes representing each gene (summarization). Each probe was also linked to its corresponding gene name (annotation), and non-relevant information was removed (filtering) [12]. Individual cohort preprocessing was performed using the RMA algorithm, a method that encompasses all 3 preprocessing steps [13]. Entrez gene IDs for Affymetrix probes were obtained from the appropriate annotation package for each microarray platform. Gene filtering removed 10% of the unexpressed and non-informative genes [14]. All of the analyses were performed using the appropriate package in R [15]. Specifically, the ‘affy’ package was used for microarray reading and for the initial preprocessing steps [16]. Gene annotation was performed using the annotation package [17]. Quality control was performed with affyQCReport [18], and the filtering procedure was performed with MetaDE software [19].

Dataset Pooling

Combining data from different studies can be beneficial for uncovering underlying biological insights that are not easily identified in few cases and can increase the statistical power of the study. However, because non-biological experimental variation or “batch effects” are observed across independent experiments, after merging cohorts, it is necessary to correct for systematic variation without compromising the structure of the data or the biological information contained within the data. Here, the cohorts GSE39671 and GSE22762 were merged and corrected for non-biological variation using The COMBAT method (empirical Bayes) implemented in the inSilicoMerging package [20].

Consensus-based Non-negative Matrix Factorization (NMF)

To predict stable clusters in the merged data, NMF was applied, which detects context-dependent patterns in gene expression data rather than dividing clusters based on distance computation. NMF is based on the decomposition of data into parts and can reduce the dimensionality of an expression set from thousands of genes to several metagenes. Each metagene is defined as a positive linear combination of genes in the expression data. NMF then groups the samples into clusters based on the gene expression pattern of the samples as positive linear combinations of these metagenes. NMF Consensus repeatedly runs the clustering algorithm against perturbations of the gene expression data and creates a consensus matrix to assess the stability of the resulting clusters [21]. Let X be an nxp non-negative matrix and r>0 be an integer. Non-negative matrix factorization consists of finding an approximation where W and H are nxp and rxp non-negative matrices, respectively. Because the objective is to reduce the dimensionality of the original data, the factorization rank r is often used, such that r << min(n, p). The objective behind this choice is to summarize and split the information contained in X into r factors: the columns of W. The main approach to NMF is to estimate the matrices W and H as a local minimum: where D is a loss function that measures the quality of the approximation. Common loss functions are based on the Frobenius norm or the Kullback-Leibler divergence. The NMF algorithm was applied in GenePattern software [22].

Cluster Number Selection and Outlier Removal

Selection of the number of classes or clusters was performed using the quantitative Cophenetic coefficient defined in Brunet et al [21]. The Cophenetic coefficient computes a score of global clustering robustness across the consensus matrix. The number of clusters was also confirmed for inspection of the graphical representation of the consensus matrix. Even though clustering methodologies using the consensus process can detect robust groups, the identification of cluster-associated genes can be influenced by unusual samples. To minimize the impact of outliers on cluster marker identification, samples with negative silhouette widths were excluded, and only samples that were significantly associated with the center of each cluster were included; this was performed using the cluster package [23].

Hierarchical Clustering and Multidimensional Scaling

To corroborate the subgroup structure in the data, in addition to the NMF method, we also applied different methodological approaches such as hierarchical clustering and multidimensional scaling. Preprocessed expression arrays were subjected to hierarchical clustering using the Ward method and the distance 1-r, where r is the Pearson correlation coefficient. Multidimensional scaling was applied to visualize subdivisions in the merged data and to evaluate the distance used for the hierarchical clustering. The analysis was performed using the cluster package [23].

Biological Evaluation of Clusters

Cluster Markers

To identify cluster-specific genes, we identified genes that were differentially expressed between clusters using significance analysis of microarray (SAM) [24], allowing the identification of up-regulated and down-regulated genes in each cluster. This method assesses differential gene expression relative to the spread of expression across all genes. The false discovery rate (FDR) was set to 0. The analysis was performed using the siggenes package [25]. The markers obtained from SAM, using the merged data, were analyzed and used to predict the more discriminatory genes for each cluster. To accomplish this, we used Prediction Analysis for Microarrays (PAM), in which the nearest shrunken centroid for the data was computed [26]. Leave one out cross validation (LOCV) was applied to cross-validate the classifier produced. The procedures were executed in the pamr package [27].

Functional Enrichment

To identify signaling pathways involved in the differences between clusters, the differentially expressed genes identified with SAM were analyzed for modular enrichment using the Genecodis server [28, 29, 30]. The method obtains co-occurrence annotations in the KEGG and Panther databases, the P values are calculated through hypergeometric analysis corrected by FDR method

Nearest Template Prediction (NTP) and Microenvironment Signature

To associate the class of a given sample (cluster membership) to a CLL microenvironment signature, the nearest template prediction algorithm (NTP) [31] was applied to the merged dataset using GenePattern software [22]. To obtain CLL microenvironment signatures, the original microarray data from Herishanu et al [32] were used. Matched tissue and blood samples that were simultaneously obtained from CLL patients were preprocessed and analyzed to identify genes that were differentially expressed between the lymph nodes (LN) and peripheral blood (PB), and genes that were differentially expressed between bone marrow (BM) and PB. Differential expression was assessed by SAM analysis (>2-fold change, FDR <20%). A B-cell receptor (BCR) signature obtained by Pede et al [33] was used after BCR stimulation of CLL cells for 24 hours, and genes with a fold change >2 were considered.

Survival Analysis

Patients from cohorts GSE39671 and GSE22762 were used to determine whether the obtained clusters were related to survival (TTT and OS). Survival curves were analyzed according to the Kaplan-Meier method and compared using the log-rank test. To evaluate the contribution of individual genes to survival, Cox regressions were applied. The analyses were performed using the survival package [34]. Given the relevance of the IGVH mutational status for the prognosis of CLL, the relationship between clusters and the mutational status was evaluated in the 4 independent cohorts.

Heatmaps

Heatmaps were generated with Gene Pattern software. The genes in the heatmaps were ordered based on their differential expression using a t test [22].

Results

Primary Cluster Identification in CLL

The use of small cohorts can prevent the identification of subgroups that are revealed when a large and heterogeneous group of samples is employed. Therefore, in this paper, we combined information from different and independent expression cohorts to increase the statistical power of the study. We independently preprocessed the expression datasets GSE22762 and GSE39671; both of these cohorts were originally assayed using the same microarray platform (Affymetrix Human Genome U133 Plus 2.0 Array). After preprocessing, we obtained 16,287 genes for each dataset. We merged the above studies and adjusted for non-biological variation, obtaining a list of 15,895 genes in common between the two studies and 237 samples in total. This data matrix is available as We used the NMF clustering method to cluster the described merged data according to gene expression and identify patient subtypes. The NMF analysis defined two distinct high-consensus CLL subgroups (Fig 2), to which we refer as cluster 1 and cluster 2. The subdivision is evident when visualizing the consensus matrix (Fig 2A) and based on the highest value of the global clustering robustness score for k = 2 (Fig 2B). Detailed clustering analysis results are presented in and graphically represented in , in which the identity of the samples, the cohort from which they were derived and their class membership after clustering with NMF are shown.
Fig 2

NMF consensus clusters for the pooled data.

NMF consensus analysis of the merged data revealed a good consensus for k = 2. A. Maximum cophenetic coefficients for k = 2 to 5 clusters and the consensus matrices for k = 2 to 5 are shown. B. Plot showing a comparison of cophenetic coefficients among k clusters. This score provides a summary of global clustering robustness across the consensus matrix, with zero indicating the least robust partition and one indicating the most robust partition. From the perspective of robustness, the maximum peak of the cophenetic coefficient plot determines the optimal number of subgroups in the data. The division in the data is also evident in the consensus matrix, which showed a clear boundary between red and blue areas, indicating robust and stable clustering in comparison with other subdivisions.

NMF consensus clusters for the pooled data.

NMF consensus analysis of the merged data revealed a good consensus for k = 2. A. Maximum cophenetic coefficients for k = 2 to 5 clusters and the consensus matrices for k = 2 to 5 are shown. B. Plot showing a comparison of cophenetic coefficients among k clusters. This score provides a summary of global clustering robustness across the consensus matrix, with zero indicating the least robust partition and one indicating the most robust partition. From the perspective of robustness, the maximum peak of the cophenetic coefficient plot determines the optimal number of subgroups in the data. The division in the data is also evident in the consensus matrix, which showed a clear boundary between red and blue areas, indicating robust and stable clustering in comparison with other subdivisions. To corroborate the partition of the samples into two different subtypes, we applied hierarchical clustering. We found class membership coincidence between the NMF consensus clustering and the hierarchical clustering in 90% of the samples, as most samples belonged to the same clusters in both analyses. This result supports two major subdivisions in the data. Furthermore, hierarchical clustering allowed the two subgroups in the 4 independent cohorts to be individually analyzed (Fig 3A and 3C–3F).
Fig 3

CLL sample clustering showing the primary transcriptional subgroups.

Dendrogram obtained by hierarchical clustering of different cohorts. A. Merged dataset. B. Multidimensional scaling of the merged data; the left panel shows evidence of two clusters, and the right panel shows the two individual cohorts after sample merging. However, some sample overlap between groups was evident. C. Cohort GSE38611. D. Cohort GSE9992. E. Cohort GSE2466. F. Cohort GSE46261.

CLL sample clustering showing the primary transcriptional subgroups.

Dendrogram obtained by hierarchical clustering of different cohorts. A. Merged dataset. B. Multidimensional scaling of the merged data; the left panel shows evidence of two clusters, and the right panel shows the two individual cohorts after sample merging. However, some sample overlap between groups was evident. C. Cohort GSE38611. D. Cohort GSE9992. E. Cohort GSE2466. F. Cohort GSE46261. Multidimensional scaling was useful for evaluating the adequacy of the distance used in the hierarchical clustering and made it possible to visualize two clusters in the data. This analysis also revealed a lack of sample grouping on the basis of sample origin, confirming that the pooled samples had been properly adjusted for batch effects (Fig 3B). In conclusion, the merged expression dataset that represents a large and heterogeneous group of patients with CLL can be naturally divided into two stable and robust transcriptional subgroups. This subdivision is supported by methodological approaches of a different nature (NMF clustering detects context-dependent patterns and hierarchical clustering divides data based on distance computation) and was also confirmed in 4 independent cohorts.

Cluster-specific Marker Identification

We identified markers associated with the CLL subtypes by searching for genes that were differentially expressed (DE) between the two clusters. The SAM analysis identified up to 3379 genes with a statistically significant difference between clusters in the merged dataset (ΔSAM = 4 and FDR = zero). Under a more stringent cut-off (corrected P value = 0 and median fold change >2), which may reflect a more biologically relevant scenario, we identified 230 genes that were differentially expressed between clusters. Some of the most highly up-regulated genes in cluster 2 were FCRLA, HDHD2, TCL1A, TNFRSF17 and SERPINI1; conversely, these genes were down-regulated in cluster 1. The most up-regulated genes in cluster 1 include SERPINB2, DENND4B, C15ORF48, ZNF331 and NR4A2; conversely, their expression was down-regulated in cluster 2. The most highly up-regulated genes in each cluster and their fold changes are presented in Table 1. A complete list of all differentially expressed genes in each cluster as well as their statistical parameters and fold changes can be found in the supplementary information ().
Table 1

Most highly up-regulated genes in each cluster (merged data).

Cluster 2Cluster 1
GeneFold changeGeneFold change
1 FCRLA 4,51 SERPINB2 5,93
2 HDHD2 4,45 DENND4B 5,53
3 TCL1A 4,15 C15ORF48 5,36
4 TNFRSF17 3,93 ZNF331 5,20
5 SERPINI1 3,51 NR4A2 4,31
6 ANXA4 3,28 G0S2 4,28
7 UGDH 3,18 METRNL 3,91
8 GAPT 3,16 SLC7A5 3,60
9 AIM2 3,15 MAFB 3,43
10 CPNE5 3,10 IL1B 3,42
Importantly, the most significantly differentially expressed genes between clusters, obtained either from the merged data or individual cohorts, showed outstanding reproducibility (Table 2). The common genes found to be up-regulated in cluster 2 and conversely down-regulated in cluster 1 include TCL1A, FCRLA, FIG4, AIM2, SELL, RAC2, CD27, SAMD9L.
Table 2

Most highly up-regulated genes in cluster 2 (independent cohorts).

GSE39671GSE22762GSE9992GSE46261GSE24666GSE38611
1 TNFRSF17 FCRLA TCL1A FCRL1 METTL7A FCRLA
2 TCL1A TCL1A SELL FCRL5 TCL1A SAMD9L
3 FCRLA HDHD2 TGFBI FCRLA CD79B FCRL1
4 HDHD2 SERPINI1 AIM2 PDGFD PTPN6 FCRL5
5 DYNLL1 ANXA4 CD79B ZMAT1 SYK TCL1A
6 CDC20 HIBCH FAM65B NDRG3 SKAP2 SLAMF6
7 IRF2 SLC25A43 TRAC SAMD9L CD27 FIG4
8 ZNF559 FIG4 C17ORF62 FCRL2 RAC2 FCRL2
9 HIST1H2AC C17ORF62 P2RY14 NIPAL2 FAM65B NDRG3
10 UGDH CPNE5 PSMB9 CCDC141 ACADM LYST
The proto-oncogene TCL1 is of particular interest due to its crucial role in CLL pathogenesis. A high level of expression of this gene is associated with CLL development [35, 36, 37]. Recently, it has been demonstrated that stromal cells modulate TCL1 expression in CLL and repress important target molecules such as FOS, JUN and members of the AP-1 complex, suggesting that microenvironment-derived signals play an important role in the survival of CLL cells [38]. TNFRSF17 was the first up-regulated gene identified in experiments in which CLL cells were co-cultured with different stromal cells [38]. This gene was also identified as one of the most significantly differentially expressed genes between clusters in the merged data, supporting the influence of stromal cells on cluster 2. FCRL family of proteins showed differentially expression between clusters, these proteins share many similar features with the classical Fc receptors and some members of this family have predictive value for determining clinical progression in CLL [39] Given the highly differential expression of TCL1 between clusters, its repeatable expression pattern in different cohorts, and its role in the microenvironment and CLL progression, we call attention to the biological implication of this gene in cluster subdivision. Given the large number of genes that are differentially expressed between clusters and for the purpose of proposing reliable cluster markers, we employed a prediction method (PAM) to find the most discriminatory genes. From the 230 genes that were differentially expressed between clusters, the method could identify a minimal set of 34 genes capable of predicting, with an overall error rate of less than 5%, the cluster membership. The resulting markers ordered by PAM score and showing the direction of gene expression are listed in Table 3. Based on our analyses, the highest score was assigned to ZNF331 as a predictive marker of clusters 1. ARID5A, C15ORF48, SLC7A5, ELL2, MTMR6, were also assigned to this cluster. HDHD2, UGDH, TNFRSF17, FCRLA C11ORF73, ZNF559, and TCL1A, among other genes, were assigned to cluster 2. Interestingly, one of the most biologically relevant genes in the cluster 2, TCL1A, has roles as proto-oncogene, and the gene with the highest discrimination score in the cluster 1, ZNF331, has roles as tumor suppressor gene.
Table 3

Cluster-specific markers after prediction analysis-PAM.

GenePAM score for cluster 2PAM score for cluster 1Fold change cluster 2 vs. 1Fold change cluster 1 vs. 2
1 ZNF331 -0,19010,15270,195,20
2 HDHD2 0,1616-0,12984,450,22
3 UGDH 0,0869-0,06983,180,31
4 TNFRSF17 0,0850-0,06823,930,25
5 FCRLA 0,0762-0,06124,510,22
6 C11ORF73 0,0709-0,05692,720,37
7 ZNF559 0,0496-0,03983,010,33
8 ARID5A -0,04950,03980,362,81
9 TCL1A 0,0414-0,03324,150,24
10 SERPINI1 0,0413-0,03323,510,28
11 RP11-35G9.3 0,0404-0,03252,840,35
12 MSH2 0,0384-0,03092,860,35
13 ACADM 0,0379-0,03053,010,33
14 FIG4 0,0363-0,02912,900,34
15 C15ORF48 -0,02940,02360,195,36
16 HIBCH 0,0228-0,01833,030,33
17 SLC7A5 -0,02240,01800,283,60
18 C17ORF62 0,0216-0,01732,870,35
19 RNASEH2A 0,0173-0,01392,410,42
20 ELL2 -0,01660,01340,382,60
21 ATG4C 0,0141-0,01143,020,33
22 SAMD9L 0,0128-0,01032,910,34
23 GOLPH3L 0,0104-0,00842,620,38
24 STX7 0,0094-0,00762,590,39
25 MTMR6 -0,00890,00720,392,57
26 HDDC3 0,0083-0,00672,350,43
27 ZDHHC16 0,0077-0,00622,580,39
28 TBCK 0,0062-0,00502,730,37
29 CYB561A3 0,0060-0,00492,450,41
30 CHD1 -0,00510,00410,442,28
31 AIM2 0,0044-0,00353,150,32
32 ACOT13 0,0023-0,00182,450,41
33 ACYP1 0,0011-0,00092,760,36
34 FCGR2B 0,0011-0,00082,880,35

Fold change from SAM

Fold change from SAM We conclude that the similarity in different cohorts with regard to differential expression patterns reflects the robustness in the group structure (i.e., the presence of two subtypes of patients), and we suggest that important genes such as TCL1A and ZNF331 are accountable for the biological subdivision. When analyzing the total list of genes that were differentially expressed between clusters through functional enrichment, many co-occurring annotations were found. The top annotations or terms, in order of corrected P values, were amino acid degradation (2.87361e -14), purine and pyrimidine metabolism (3.32583e-13 and 1.00552e-11, respectively), B cell receptor signaling pathway (6.55445e-11), protein processing in endoplasmic reticulum (8.21315e-11), RNA degradation (2.76037e-10), and RNA transport (6.45845e-10). MAPK signaling also had a significant P value (1.52119e-06). Given the importance of signaling pathways in cancer, we enlisted the genes identified in the analysis that were involved in the BCR and MAPK signaling pathways. Genes involved in the BCR signaling pathway included MAPK1, CR2, CD19, BTK, PIK3R5, SYK, NFKB1, VAV1, AKT1, CD79B, NFATC1, PPP3CB, PIK3CA, BLNK, FCGR2B, MAP2K2, PIK3R2, IKBKB, PIK3AP1, RELA, RAF1, NRAS, SOS1, NFKBIB, NFATC2, PIK3R1, RAC2, PTPN6, PPP3CA, PRKCB, and NFATC3. Genes involved in MAPK signaling included MAPK1, BRAF, MAPK9, PAK2, NFKB1, AKT1, PPP3CB, MAP2K2, IKBKB, RAF1, SOS1, NFATC2, NFKB2, CDC42, and PPP3CA, among others. When the number of differentially expressed genes was reduced to include only those with the largest differences in expression, the above annotations were maintained with significant P vales, and the BCR and MAPK pathways are highlighted (Fig 4).
Fig 4

Tag cloud representing the most significant terms enriched in the list of DE genes.

The sizes of the tags vary according to the number of supporting genes found in each enriched term.

Tag cloud representing the most significant terms enriched in the list of DE genes.

The sizes of the tags vary according to the number of supporting genes found in each enriched term. It is possible that differences in the clusters are due to the B-cell receptor (BCR) activation, which can trigger the activation of downstream signaling pathways such as the MAPK pathway. This response can vary depending on the cellular microenvironment.

Microenvironment Signature Activation

Because the tumor microenvironment may contribute to CLL pathogenesis, we searched for possible microenvironment associations in the data. To associate the samples with a CLL microenvironment signature, we used the NTP algorithm. We used the raw data from Herishanu et al [30] to obtain a microenvironment signature and found 86 differentially expressed genes between the LN and PB (>2-fold change, FDR <20%) (). Many of the genes that were overexpressed in the LN are considered BCR target genes. Functional analysis of this microenvironment signature using Genecodis software identified a set of BCR-related genes as the most overrepresented; the NF-κB and NFAT pathways were also represented, both of which are activated by BCR signaling. Therefore, the microenvironment signature obtained here indicates the activation of distinct signaling pathways and tumor proliferation in the LN, as reported previously by Herishanu et al [32]. After NTP, 88.6% of the samples were assigned to one of two possible signature classes using FDR <0.05, and 93.4% of the samples were assigned to one of two possible signature classes using FDR <0.2. The CLL samples were enriched for the microenvironment-related signature, even though a relationship with specific clusters was not clearly found (Fig 5). It was possible to find agreement between the prediction made by the method (signature class assigned) and the two clusters in up to 56% of the cases. We also used the BCR stimulation signature previously described by Pede V et al [33] and observed a less confident prediction (80.3% of samples using FDR <0.2 and 57.6% of samples using FDR <0.5) and a lack of clear association with the clusters. The tested bone marrow signature also failed to show a clear relationship with a specific cluster.
Fig 5

Heatmaps showing the association of clusters with A. microenvironment (ME) and B. BCR stimulation signatures. Both clusters showed a degree of ME and BCR signaling activation. Nearest template prediction (NTP) using C. microenvironment and D. BCR stimulation signature.

The division of molecularly heterogeneous samples into two clusters can be influenced by multiple and complex processes, including the influence of the cell microenvironment. Additionally, signatures applied in the prediction method are very particular and specific. Therefore, it was not possible to link all the samples to the microenvironment signature tested.

Clustering and Survival Analysis

To evaluate the clinical relevance of the clustering, we assessed cluster membership in relation to overall survival and time to treatment using the GSE22762 and GSE39671 datasets, respectively. Kaplan-Meier curves showed that the cluster 2 patients had poorer outcomes compared to the patients of cluster 1 (Fig 6). We compared the two groups using the log-rank test to evaluate the prognostic value of the model, and this analysis revealed a highly significant difference between expression levels and TTT and a nearly significant difference in OS.
Fig 6

Clinical outcomes of the patients in the two cohorts.

A. GSE39671, time to treatment (TTT) according to the cluster 1 and cluster 2. B. GSE22762, overall survival (OS) according to the cluster 1 and cluster 2.

Clinical outcomes of the patients in the two cohorts.

A. GSE39671, time to treatment (TTT) according to the cluster 1 and cluster 2. B. GSE22762, overall survival (OS) according to the cluster 1 and cluster 2. To evaluate the contribution of individual genes to the prognostic difference between clusters, we applied Cox regressions to 230 genes (genes with the highest expression differences between clusters, P value = 0 and median fold change >2). The analysis confirmed the results of the Kaplan-Meier curves: the two clusters showed prognostic differences, and almost all of the up-regulated genes in each cluster have the same relationship with survival (i.e., negative for cluster 2 and positive for cluster 1) (). Genes with statistically significant differences for both survival indicators (TTT and OS) can be considered highly informative of survival and are listed in Table 4. NRIP1 and MAFB from cluster 1 are highlighted due to their lower P values and positive relationship.
Table 4

Genes showing common statistically significant differences for TTT and OS.

Cluster 2Cluster 1
Genesurvival outcomeTTTOSGensurvival outcomeTTTOS
1 FCRLA neg4,79E-036,74E-041 SERPINB2 pos5,90E-035,28E-04
5 SERPINI1 neg2,41E-023,75E-022 DENND4B pos5,84E-032,39E-04
10 CPNE5 neg1,53E-024,30E-033 C15ORF48 pos4,87E-021,02E-02
11 HIBCH neg4,84E-031,29E-026 G0S2 pos3,33E-035,01E-03
18 FCGR2B neg1,20E-031,02E-039 MAFB pos3,40E-034,49E-07
21 RP1135G93 neg3,25E-029,26E-0310 IL1B pos1,23E-021,16E-03
26 NAPSB neg1,69E-021,27E-0313 NINJ1 pos1,53E-035,10E-03
30 CD27 neg2,64E-034,53E-0414 NRIP1 pos1,23E-021,89E-10
35 DYNLL1 neg4,54E-031,09E-0215 PFKFB3 pos1,20E-021,96E-03
36 ZDHHC16 neg2,49E-023,23E-0218 IER3 pos3,90E-032,03E-04
37 SPIB neg1,21E-028,54E-0319 SGK1 pos2,91E-026,59E-05
38 TMEM14C neg4,84E-033,69E-0220 THBS1 pos2,04E-032,14E-04
43 TMEM251 neg4,63E-034,33E-0223 IL8 pos3,58E-031,94E-03
47 ACOT13 neg1,53E-021,30E-0225 C5AR1 pos3,36E-032,75E-03
49 RAC2 neg6,56E-039,14E-0329 GNA15 pos7,37E-031,06E-03
51 RNASEH2A neg4,50E-036,34E-0334 FOSL2 pos3,73E-024,59E-03
58 HDDC3 neg1,35E-023,34E-0235 TREM1 pos1,59E-023,13E-02
59 KIAA1407 neg9,41E-033,59E-0340 THBD pos8,48E-036,87E-03
62 AIDA neg6,66E-044,71E-0241 UPP1 pos8,78E-031,12E-04
64 VPREB3 neg2,16E-023,08E-0343 CCR1 pos1,53E-027,36E-04
78 FCRLB neg2,71E-033,52E-0453 PLAUR pos4,95E-033,57E-03
79 DAD1 neg4,63E-022,19E-0256 WHAMM pos3,51E-027,29E-03
86 RUVBL1 neg1,82E-038,55E-0375 SMIM3 pos1,11E-022,38E-04
93 ECHS1 neg1,44E-029,43E-03
111 CDK2AP2 neg7,40E-033,81E-02
116 MPV17 neg2,42E-023,57E-04
120 RP5886K23 neg1,96E-022,20E-04
121 CISD1 neg2,44E-035,61E-03
126 ECI1 neg1,19E-028,14E-05
138 AP2B1 neg2,07E-021,47E-02
142 PRDX1 neg7,02E-031,16E-02
144 SWI5 neg3,35E-023,00E-03
147 BTK neg3,92E-027,97E-03     

positive (pos), negative (neg). Genes listed by fold change.

Of the 230 genes analyzed, it was found that several genes were associated with TTT and OS (111 and 101, respectively). A total of 83 genes had a negative relationship with TTT, and 28 genes had a positive relationship with TTT, whereas 60 genes had a negative relationship with OS and 41 had a positive relationship with OS (). positive (pos), negative (neg). Genes listed by fold change. Analyzing the clusters and the relationship to IGVH mutational status in the 4 independent cohorts, we found that the segregation of the mutational status was independent of the cluster membership; this was confirmed in all 4 independent cohorts, as seen in the heatmaps of Fig 7. Furthermore, when examining known genes related to IGVH mutational status (genes previously reported in the literature as expressed with a particular pattern in mutated vs. non-mutated IGVH, e.g., LPL, ZAP70, CRY1, and ZBTB20), it was found that these markers were not differentially expressed in the clusters.
Fig 7

Heatmaps depicting the differential gene expression pattern in clusters cluster 1 and cluster 2 as well as the IGVH status.

A, B, C, and D: the GSE38611, GSE2466, GSE9992 and GSE46261 cohorts, respectively. Cluster membership was independent of IGVH mutational status.

Heatmaps depicting the differential gene expression pattern in clusters cluster 1 and cluster 2 as well as the IGVH status.

A, B, C, and D: the GSE38611, GSE2466, GSE9992 and GSE46261 cohorts, respectively. Cluster membership was independent of IGVH mutational status. In conclusion, the survival analysis of the two previously recognized clusters revealed a survival difference that may be attributable to gene expression. Several genes emerged as prognostic markers of survival. The gene expression differences between clusters observed here could provide new information about CLL prognosis that is independent of the IGVH mutational status.

Discussion

In this paper, using a robust methodology and several cohorts of CLL patients reflecting a broad spectrum of molecular events in the disease, it was possible to distinguish two different patient subgroups and identify subgroup-specific genes. The similarity in the different cohorts, with regard to differential expression patterns between the two identified subgroups, reflects the robustness of the structure. The subdivisions were related with differential clinical outcomes and genes associated with microenvironment and the MAPK and BCR signaling pathways. The TCL1A gene is important in the distinction between clusters due to its up-regulated expression in one of the clusters, reproducibility between cohorts, and its role in the CLL microenvironment and CLL pathogenesis. A high expression level of this proto-oncogene has been associated with causal events in the development of CLL [35, 36, 37]. Sivina et al [38] showed that TCL1A was among the top genes up-regulated in CLL cells by bone marrow stromal cells (BMSCs). These authors provided evidence that the stromal microenvironment induces TCL1A overexpression in CLL cells and represses TCL1A target molecules (AP-1 proteins of the FOS/JUN family). Particularly in lymphoid cells, AP-1 proteins can exhibit induction of apoptosis and tumor-suppressive roles [40, 41, 42]. Therefore, these results suggest that TCL1A inhibits AP-1-regulated pro-apoptotic activities that normally control B cells. Interestingly, TCL1A and antigen receptors mediated signaling have been previously associated [43, 44, 45]. TCL1A seems to acts as a modulator of B-cell receptor–kinase activity, regulating the strength of BCR-mediated cellular activation. The subsequent cellular outcome, associated with apoptosis, growth, inertia, seems primarily determined by a TCL1A-dependent (AKT) [44]. The importance of TCL1A as a modulator of microenvironment-derived stimuli, suggest its pharmacologic intervention as a treatment rationale for CLL. Therapeutic approaches to disrupt BMSC interactions in CLL are being developed [46, 47], and the present study supports the division of patients based on expression of this gene prior to administration of therapy. These findings suggest that the microenvironment had a specific influence in patients from cluster 2, this result may be related to the inhibitory activity of critical pro-apoptotic factors that favor cellular survival. Although TCL1A showed no statistically significant differences when examined individually (OS: 0,0599; TTT: 0,0626), it is possible that the influence of this gene on patient survival is indirect and is related to its target genes. The TNFRSF17 gene also support the influence of stromal cells on cluster 2, this gene was the first up-regulated gene identified in experiments in which CLL cells were co-cultured with different stromal cells [38], This gene was identified as one of the most significantly differentially expressed genes between clusters in our merged data. On the other hand, the ZNF331 gene is of particular interest for cluster 1 due to its high score in the prediction analysis, this gene is a Kruppel-associated-box zinc-finger protein gene with a role in TP53 reactivation and induction of tumor cell apoptosis. Nahi et al [48] found evidence of dose-dependent apoptosis and cytotoxicity in CLL cells and suggested that ZNF331 is a small molecule that targets TP53, which could be useful for the treatment of drug-resistant leukemia. In addition, some evidence suggests that ZNF331 expression in CLL is associated with a higher risk of relapse after treatment, suggesting its use a potential marker for risk [49]. Yu et al [50] recently reported that ZNF331 is a candidate tumor suppressor gene primarily involved in gastric carcinogenesis, and Vedeld et al [51] found evidence that this gene is methylated in gastrointestinal cancers. Given the role of ZNF331 as a putative tumor suppressor and the findings demonstrating the important tumor-suppressing functions of zinc-finger proteins and their promising application in cancer therapy, it is worth exploring the functional role of this gene in CLL. Based on the modular enrichment analysis and the examination of differentially expressed genes, it is possible to speculate that differences in the clusters are due to B-cell receptor (BCR) activation and downstream signaling. The MAPK signaling pathway is one the pathways activated by the BCR receptor [52]. Antigen-dependent BCR activation has been shown to accelerate disease progression in a mouse lymphoma model [53]. Enrichment of the MAPK signaling pathway in CLL is consistent with recent work by Chuang et al [6]; these authors identified gene co-expression subnetworks that were associated with disease progression. In one of these subnetworks, genes in the MAPK signaling pathway had higher expression levels in patients at early stages of the disease. The groups obtained here are supported by a robust methodology. Different clustering methods have been developed and used to search for structure in gene expression data and extract meaningful biological information. However, each method has limitations, and there is no consensus regarding the best method of clustering. Therefore, we applied different unsupervised methodologies to confirm the structure of the two groups. We applied NMF consensus clustering and hierarchical clustering, and for most of the samples, the class membership results were congruent. NMF clustering appears to have some advantages over other methods, as it is not based on distances and provides a quantitative measure with which to identify the number of clusters. Thus, we used this algorithm for our further analysis to identify cluster markers. NMF clustering has been successfully used in other cancer studies. For example, Collisson et al [54] identified subtypes of pancreatic ductal adenocarcinoma and their differing responses to therapy, and Sadanandam et al [55] proposed a colorectal cancer classification scheme associated with phenotype and responses to therapy. Without a doubt, unsupervised class discovery in cancer research has led to the identification of subgroups with prognostic implications and generated multiple biomarkers of major importance. However, unsupervised clustering in CLL has been poorly explored, most studies of CLL have been focused on the analysis of known prognostic markers such as IGVH status, cytogenetic aberrations and mutated genes recently identified by next-generation sequencing [1–3, 5]. To our knowledge, the use of unsupervised clustering of expression data in CLL is just beginning to be explored [56]. The present work provides additional information that aids our understanding of this disease, including information about a range of transcriptional markers with potential clinical implications.

Consensus matrix, samples and class membership.

(PDF) Click here for additional data file.

Genes that were differentially expressed between clusters based on SAM.

Cluster 2 vs. 1 (Table A). Cluster 1 vs. 2 (Table B). (XLSX) Click here for additional data file.

Genes with statistically significant differences for survival.

Time to treatment (Table A). Overall survival (Table B). (XLSX) Click here for additional data file.

Preprocessed, merged and adjusted data for 237 samples.

(TXT) Click here for additional data file.

Samples and class membership.

(TXT) Click here for additional data file.

Genes that were differentially expressed between LN and PB.

(TXT) Click here for additional data file.
  45 in total

1.  Molecular and transcriptional characterization of 17p loss in B-cell chronic lymphocytic leukemia.

Authors:  Sonia Fabris; Laura Mosca; Katia Todoerti; Giovanna Cutrona; Marta Lionetti; Daniela Intini; Serena Matis; Monica Colombo; Luca Agnelli; Massimo Gentile; Mauro Spriano; Vincenzo Callea; Gianluca Festini; Stefano Molica; Giorgio Lambertenghi Deliliers; Fortunato Morabito; Manlio Ferrarini; Antonino Neri
Journal:  Genes Chromosomes Cancer       Date:  2008-09       Impact factor: 5.006

2.  Pharmacologic interception in T-cell leukemia 1A associated pathways as a treatment rationale for chronic lymphocytic leukemia.

Authors:  Wagma Popal; Jorge Boucas; Abdul Ali Peer-Zada; Marco Herling
Journal:  Leuk Lymphoma       Date:  2010-08

3.  High TCL1 levels are a marker of B-cell receptor pathway responsiveness and adverse outcome in chronic lymphocytic leukemia.

Authors:  Marco Herling; Kaushali A Patel; Nicole Weit; Nils Lilienthal; Michael Hallek; Michael J Keating; Dan Jones
Journal:  Blood       Date:  2009-09-21       Impact factor: 22.113

Review 4.  CXCR4 antagonists: targeting the microenvironment in leukemia and other cancers.

Authors:  J A Burger; A Peled
Journal:  Leukemia       Date:  2008-11-06       Impact factor: 11.528

5.  TCL1 in B-cell tumors retains its normal b-cell pattern of regulation and is a marker of differentiation stage.

Authors:  Marco Herling; Kaushali A Patel; Eric D Hsi; Kong-Chao Chang; George Z Rassidakis; Richard Ford; Dan Jones
Journal:  Am J Surg Pathol       Date:  2007-07       Impact factor: 6.394

6.  FCRL2 expression predicts IGHV mutation status and clinical progression in chronic lymphocytic leukemia.

Authors:  Fu Jun Li; Shouluan Ding; Jicun Pan; Mikhail A Shakhmatov; Elena Kashentseva; Jiongru Wu; Yufeng Li; Seng-jaw Soong; Nicholas Chiorazzi; Randall S Davis
Journal:  Blood       Date:  2008-02-26       Impact factor: 22.113

7.  Tcl1 functions as a transcriptional regulator and is directly involved in the pathogenesis of CLL.

Authors:  Yuri Pekarsky; Alexey Palamarchuk; Vadim Maximov; Alexey Efanov; Natalya Nazaryan; Urmila Santanam; Laura Rassenti; Thomas Kipps; Carlo M Croce
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-08       Impact factor: 11.205

8.  GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists.

Authors:  Pedro Carmona-Saez; Monica Chagoyen; Francisco Tirado; Jose M Carazo; Alberto Pascual-Montano
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

9.  GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information.

Authors:  Ruben Nogales-Cadenas; Pedro Carmona-Saez; Miguel Vazquez; Cesar Vicente; Xiaoyuan Yang; Francisco Tirado; Jose María Carazo; Alberto Pascual-Montano
Journal:  Nucleic Acids Res       Date:  2009-05-22       Impact factor: 16.971

10.  The B cell antigen receptor and overexpression of MYC can cooperate in the genesis of B cell lymphomas.

Authors:  Yosef Refaeli; Ryan M Young; Brian C Turner; Jennifer Duda; Kenneth A Field; J Michael Bishop
Journal:  PLoS Biol       Date:  2008-06-24       Impact factor: 8.029

View more
  2 in total

1.  Unveiling differentially expressed genes upon regulation of transcription factors in sepsis.

Authors:  Junli Zhang; Yuelei Cheng; Minmin Duan; Nannan Qi; Jian Liu
Journal:  3 Biotech       Date:  2017-04-25       Impact factor: 2.406

2.  Novel genes exhibiting DNA methylation alterations in Korean patients with chronic lymphocytic leukaemia: a methyl-CpG-binding domain sequencing study.

Authors:  Miyoung Kim; Eunyup Lee; Dae Young Zang; Hyo Jung Kim; Ho Young Kim; Boram Han; Han-Sung Kim; Hee Jung Kang; Seungwoo Hwang; Young Kyung Lee
Journal:  Sci Rep       Date:  2020-01-23       Impact factor: 4.379

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.