Literature DB >> 33916856

Cancer Subtype Recognition Based on Laplacian Rank Constrained Multiview Clustering.

Shuguang Ge¹, Xuesong Wang¹, Yuhu Cheng¹, Jian Liu¹.

Abstract

Integrating multigenomic data to recognize cancer subtype is an important task in bioinformatics. In recent years, some multiview clustering algorithms have been proposed and applied to identify cancer subtype. However, these clustering algorithms ignore that each data contributes differently to the clustering results during the fusion process, and they require additional clustering steps to generate the final labels. In this paper, a new one-step method for cancer subtype recognition based on graph learning framework is designed, called Laplacian Rank Constrained Multiview Clustering (LRCMC). LRCMC first forms a graph for a single biological data to reveal the relationship between data points and uses affinity matrix to encode the graph structure. Then, it adds weights to measure the contribution of each graph and finally merges these individual graphs into a consensus graph. In addition, LRCMC constructs the adaptive neighbors to adjust the similarity of sample points, and it uses the rank constraint on the Laplacian matrix to ensure that each graph structure has the same connected components. Experiments on several benchmark datasets and The Cancer Genome Atlas (TCGA) datasets have demonstrated the effectiveness of the proposed algorithm comparing to the state-of-the-art methods.

Entities: Chemical Disease Gene Species

Keywords: Laplacian Rank Constrained; cancer subtype recognition; graph learning; multiview clustering

Year: 2021 PMID： 33916856 PMCID： PMC8065670 DOI： 10.3390/genes12040526

Source DB: PubMed Journal: Genes (Basel) ISSN： 2073-4425 Impact factor: 4.096

1. Introduction

Tumor is a malignant heterogeneous disease caused by changes in cellular components at the levels of expression, epigenetics, transcription and proteomics. The heterogeneity will be reflected in that the same cancer will produce the subtypes with different phenotypes, which will affect the clinical treatment and prognosis [1,2]. With the development and maturity of new generation sequencing technologies, large amounts of biological data are collected in public databases that are easily accessible to researchers [3]. For example, The Cancer Genome Atlas (TCGA), a landmark cancer genomics project, stores information on biological processes such as mRNA expression data, DNA methylation data, miRNA expression data and mutation data for more than 30 cancers and thousands of cancer patients [4]. Therefore, in order to solve the problem of cancer subtype recognition, building a multiview clustering model that makes full use of biological information plays a significant role. In order to implement the task of clustering, scholars initially focus on dimensionality reduction, matrix decomposition and linear regression technologies. They all use different strategies to project high-dimensional data into low-dimensional feature space, and then achieve clustering by k-means [5,6,7,8,9,10]. For example, an effective classical method, iCluster [5], builds a Gaussian latent variable model and its modified version, iClusterPlus [6], considers different variable types following different linear probabilistic relationships to build a regression model. Both of them achieve a low-dimensional space with the combination of different biological characteristics. The other method, Pattern Fusion Analysis (PFA) [10], first uses an improved Principal Component Analysis (PCA) to find out a low-dimensional matrix of each sample, and then uses an adaptive alignment algorithm to build a fused low-dimensional feature space. However, these methods may further dilute the already low signal-to-noise ratio and increase the noise pollution to the results. Considering that the sample (patient) size of the biological data is much smaller than the feature (gene) size, some graph-based learning methods for cancer subtype recognition are designed [11,12,13,14,15,16,17]. These methods use the sample points to quickly construct the similarity graph, which can be converted into the problem of spectral clustering. For example, a widely mentioned algorithm, Similarity Network Fusion (SNF) [11], constructs the global and local similar networks for each data, and then integrates them into the final similar network based on the strategy of information propagation to dilute low similarity and enhance high similarity. Inspired by SNF, Ma et al. provided Affinity Network Fusion (ANF) [12], which constructs patients’ k-nearest neighbor similar network for each data type, and then fuses these networks based on the random walk method. In addition, Yu et al. proposed Multiview Clustering using Manifold Optimization (MVCMO) [17], and solved the problem of spectral clustering optimization by using the line search method on Stiefel manifold space. However, most existing graph-based multiview clustering methods separate the data clustering process from graph learning process [18,19]. In some methods, the construction of the graph is independent of the clustering task, resulting in its performance being highly dependent on the predefined graph. Recently, some adaptive graph learning methods using a rank constraint on the Laplacian matrix have been able to directly reveal the clustering structure, which makes the graph construction closely related to the clustering task [20,21,22,23,24]. In addition, the similarity between sample points may commonly behave differently in different views in the process of graph fusion. Some existing algorithms simply take the average of the affinity graph of multiple views to represent the result of the fusion graph [25,26]. Therefore, the rich heterogeneous information is not fully utilized. To sum up, we designed a graph-based multiview clustering algorithm, called Laplacian Rank Constrained Multiview Clustering (LRCMC). Firstly, the Laplacian Rank Constraint (LRC) algorithm [27] is used to simultaneously find the affinity graph and the embedding matrix in each view to ensure that the graph structure is on the same connected components. Then, based on the method of Nie et al. [24], we use LRC method to obtain the consistent graph, whose connected components are the same as the affinity graph of each view. Finally, the clustering structure is obtained. In the process of graph fusion, the inverse distance weighting scheme is employed to design different weights for each view’s affinity graph [24], so as to adjust the structure of the consistent graph more effectively. Moreover, the processes of graph learning, graph fusion and clustering are coupled into an optimization problem to update the more accurate consistent graph and improve the results of the clustering. In order to evaluate the effectiveness of the proposed method, experiments were carried out on four benchmark datasets and four TCGA datasets. Four start-of-the-art methods were used for comparison. The values of Accuracy (ACC), Normalized Mutual Information (NMI) and Purity on benchmark datasets, which are commonly used metrics in clustering analysis, and the p value obtained from survival analysis on the TCGA dataset can all show that the proposed LRCMC approach achieves considerable improvement over the state-of-the-art baseline methods. In the analysis of the Glioblastoma Multiforme (GBM) subtypes, we found these clusters have biological significance, e.g., the Proneural subtype granted by G-CIMP phenotype has a better survival advantage. The source code and datasets can be found in the Supplementary File 1.

2. Methods

The overall flow of LRCMC is shown in Figure 1. Specifically, given a set of omics data with m views , a set of affinity graph matrices are constructed, respectively, according to . It should be emphasized that the process of learning the affinity matrix in LRCMC is different from most multiview clustering algorithms. are not calculated directly from the original matrix, but are constructed, respectively, from a set of embedding matrices by the LRC method. Therefore, each affinity graph matrix is constrained to the same connected components, which ensures that each affinity graph has a similar structure before the fusion process. Then, the proposed fusion method is applied to the affinity graph matrices of all views in order to learn a consistent graph matrix . Simultaneously, each view is automatically assigned a different weight to represent its contribution to during the fusion process. Finally, the learned consistent graph matrix is used to optimize the affinity graph matrix for each view. The LRC method is also imposed to constrain that the number of connected components in the is equal to the required number of clusters c by constructing the fusion embedded matrix . Our LRCMC improves the affinity matrix of each view, builds a fused consistent graph matrix and obtains clustering results simultaneously.

Figure 1

The flow chart of Laplacian Rank Constrained Multiview Clustering.

2.1. Construction of Affinity Graph Based on LRC

Given a single biological data denotes the v-th view data with d features, where n is the number of data points. represents the similar relationship between the sample points in the graph learning framework. The smaller the distance between a pair of vertices in the graph is, the greater the similarity between the pair of vertices will be, the greater the corresponding weight will be, and vice versa. Based on the manifold structure of graph, the most traditional way to build is by generating a k-nearest neighbor graph for it. A pair of vertices are considered connected if they are near neighbors. There are other effective strategies to design more accurate affinity graph , e.g., smooth representation [28], Gaussian kernel for similarity learning [29], etc. For the purpose of clustering, if the sample points can be assigned to the c categories, the obtained should contain exact c connected components. Based on the following Theorem 1, can be realized. Multiplicity c of zero eigenvalues of Laplacian matrixis equal to the number of connected components of its similarity matrix. When all the elements in satisfy the non-negative condition, its Laplacian matrix has the above property [30,31]. Theorem 1 means if , where is i-th smallest eigenvalue of , the data points on have been ideally assigned to c categories [32], Laplacian rank meets the constraint condition . Therefore, based on the Ky Fan’s theorem [33], we can minimize approximately meeting the requirement of Theorem 1. The objective function is written as: where is obtained by the c eigenvectors of corresponding to the c smallest eigenvalues. denotes the trace operator, is the Laplacian matrix, is a diagonal matrix and its elements are column sums of . However, the solution to in Equation (1) is actually to solve trivial solution to . Therefore, a -norm regularization term is employed to obtain smooth and each column of satisfies , where is the jth column of [21]. Finally, we can obtain the objective function related to and simultaneously: where is the regularization parameter. A set of the affinity graph matrices and the embedded matrices are obtained through Equation (2) without the participation of the original data. However, these affinity matrices are unrelated; if they are simply stacked together for clustering, the graphs will be badly damaged and the algorithm performance will degrade. Therefore, we need to introduce a graph fusion strategy to construct a consistent graph matrix with the unified connected components.

2.2. Graph Fusion with LRC

Integrating these basic graphs to form the fused affinity graph , two intuitive points should be considered: (1) The designed graph for each view can be considered as the consistent graph with noise representation and outlier interference. (2) closer to should be given greater weight to reduce the perturbation of the low-quality graphs on the fusion graph. In this way, can accurately capture the true similarity hidden in the multiview data. Therefore, we employed the proposed method of Nie et al. [24] to optimize as follows: where is the weight of the single affinity graph . The inverse distance weighting scheme is designed to calculate . The Lagrange function of Equation (3) can be written as: where is the Lagrange multiplier, is the formal term derived from constraint condition. Taking the derivative of Equation (4) w.r.t and setting the derivative to zero, we can obtain: where is given as follows: Here, a set of weights and a consistent graph matrix are obtained from Equation (3). In order to make the learned also have c connected components for clustering, the LRC term is added to Equation (3) according to Theorem 1 and Ky Fan’s Theorem. The objective function is as follows: where is obtained by the c eigenvectors of corresponding to the c smallest eigenvalues, i.e., the embedded matrix corresponding to . is the Laplacian matrix, is a diagonal matrix and its elements are column sums of . is the regularization parameter.

2.3. LRCMC Algorithm

As described in Section 2.1 and Section 2.2, the LRC operation is used to guarantee the structures of and . Therefore, we can combine Equations (2) and (7) into a final objective function, i.e., the proposed Laplacian Rank Constrained Multiview Clustering (LRCMC). It is represented as: Here, we complete the tasks of graph construction, graph fusion and clustering in one step through the integrated model. In this way, the learning of and can help each other embedded in a joint coupling problem. The objective function Equation (8) enjoys the following properties: Our method can effectively learn a set of affinity graph matrices with c connected components, instead of most multiview clustering methods requiring predefined graphs; In the graph fusion process, we assign the weight to each view to represent their contribution to the consistent graph , rather than simply superimposing then together; We use LRC to constantly adjust the structures of and , and at the same time complete the task of clustering.

2.4. Optimization Algorithm of LRCMC

Obviously, since the variables in Equation (8) are coupled to each other, we use alternating iterative method and Augmented Lagrange Multiplier (ALM) scheme to solve , , , , . The specific solution process is as follows: Fix , , and , solve ; The Equation (8) becomes: Due to , where and denote the i-th and j-th column of , respectively, denotes the (i, j)th element of , the Equation (9) can be written in vector form as: Denote , Equation (10) is obviously written as: Then, the Equation (11) can be written as: Therefore, the Lagrangian function of Equation (12) combined with its constraints can be defined as: where is the Lagrangian coefficient scalar and is the Lagrangian coefficient vector. Based on the Karush-Kuhn-Tucker (KKT) condition [34], the optimal solution of can be estimated as: The study in [35] found that sparse representation is robust to noise and outliers. In order to obtain the sparse affinity graph , we can find the k nonzero adaptive neighbors for to satisfy and . Denote , then, we arrive at: Moreover, according to Equation (15) and the constraint condition , is given: Therefore, according to Equations (15) and (16), the range of is obtained as follows: Then, the parameter can be set as: Finally, according to Equations (15), (16) and (18), the optimal solution of in is represented as: Fix , , and , solve ; Updating in Equation (8) is converted to Equation (2). Therefore, is updated from Equation (1) in Section 2.1. Fix , , and , solve ; Updating in Equation (8) is equivalent to Equation (3). Therefore, is updated from Equation (6) in Section 2.2. Fix , , and , solve .; Updating Z in Equation (8) is converted to Equation (7). Due to , where and denote the i-th and j-th column of , denotes the (i, j)th element of , Equation (7) yields: Denote , we have: Based on the Karush–Kuhn–Tucker (KKT) condition [34], the closed form solution of can be estimated as: Equation (22) can be solved by an efficient optimization method proposed in [35]. Fix , , and , solve . According to the method of finding , is obtained as follows: The final solution of is the c eigenvectors of corresponding to the c smallest eigenvalues.

3. Experiments’ Results

In order to verify effectiveness of LRCMC in cancer subtype recognition, LRCMC was compared with four state-of-the-art clustering algorithms, i.e., ANF [12], SNF [11], PFA [10] and MVCMO [17]. Since biological omics data are not labeled, we first downloaded four widely used benchmark datasets containing real labels, i.e., 3-source, Calt-7, MSRC, WebKB, to verify that proposed LRCMC can achieve good clustering effect. Furthermore, we applied LRCMC to the datasets downloaded and preprocessed by Wang et al. [11] from TCGA. The datasets contain four types of cancer, i.e., GBM, Breast Invasive Carcinoma (BIC), Lung Squamous Cell Carcinoma (LSCC) and Colon Adenocarcinoma (COAD).

3.1. Comparison Experiments on Benchmark Datasets

The benchmark datasets are described as follows: 3-source [20]: It contains 169 news that were reported by three news magazines, i.e., BBC, Reuters, and The Guardian. There are six different thematic labels for each news; Calt-7 [36]: The object recognition dataset is drawn from the Caltech101 dataset to screen 7 widely used classes, i.e., faces, motorbikes, dollar bill, Garfield, stop sign, and Windsor chair. Each class has 1474 images. Each image is described by 6 features, i.e., GABOR, wavelet moment (WM), CENT, HOG, GIST and LBP; MSRC [37]: The scene recognition dataset contains 7 classes of aircraft, car, bicycle, cow, faces, tree, and building. Each image is described by 5 features, i.e., color moment (CMT), HOG, LBP, CENT, GIST; WebKB [20]: It collects 203 web pages in 4 classes from the University’s Computer science department. Each page has 3 features, i.e., the content of the page, the anchor text of the hyperlink, and the text description in the title. Table 1 is an overview of these datasets, where n, m, and c describe the number of samples, views, and classes for each dataset, respectively, d denotes the i-th feature of these datasets.

Table 1

Overview of four benchmark datasets.

Dataset	n	m	c	d₁	d₂	d₃	d₄	d₅	d₆
3-source	169	3	6	3560	3631	3638	-	-	-
Calt-7	1474	6	7	48	40	254	1984	512	928
MSRC	210	5	7	48	100	256	1302	512	-
WebKB	203	3	4	1703	230	230	-	-	-

Three commonly used evaluation metrics, i.e., Accuracy (ACC), Normalized Mutual Information (NMI) and Purity, are used to quantitatively measure the clustering performance of the algorithms. The metrics compare the resulting labels with the real labels provided by the dataset. The larger the value obtained, the better the clustering results. To ensure the fairness of the comparison experiments, each algorithm was run 10 times to reduce the impact of randomness. The mean and standard deviation of the obtained metrics were calculated. In addition, the neighbor k required by ANF, SNF, MVCMO and LRCMC was set within the range of , and other parameters were specified as the default values provided by the authors. Only one parameter needs to be set in our LRCMC algorithm, which is caused by the introduction of LRC. In order to achieve rapid convergence of Algorithm 1, we adopt a dynamic parameter updating method proposed by Nie et al. [23]. is set in the range of [1,30]. If the number of connected components of is greater than c, we will shrink (). On the contrary, if less than c, we will increase () until finding the right components for . Table 2 shows the final evaluation metrics obtained by these algorithms in the four datasets. It is obvious that LRCMC achieves better clustering performance in the multiview clustering task than the other methods.

Table 2

The clustering performance comparison in terms of ACC, NMI and Purity on the four real datasets.

Datasets	Methods	ACC	NMI	Purity
3-source	ANF	0.4970 (0.0000)	0.2804 (0.0000)	0.5325 (0.0000)
	SNF	0.7811 (0.0000)	0.6942 (0.0000)	0.8166 (0.0000)
	PFA	0.4562 (0.0761)	0.2247 (0.0713)	0.7160 (0.0578)
	MVCMO	0.4221 (0.0123)	0.3035 (0.0128)	0.5266 (0.0118)
	LRCMC	0.8107 (0.0000)	0.7218 (0.0000)	0.8462 (0.0000)
Calt-7	ANF	0.6696 (0.0000)	0.6203 (0.0000)	0.8684 (0.0000)
	SNF	0.6601 (0.0000)	0.5637 (0.0000)	0.8562 (0.0000)
	PFA	-	-	-
	MVCMO	0.6654 (0.0100)	0.5179 (0.0355)	0.8464 (0.0083)
	LRCMC	0.8548 (0.0000)	0.7694 (0.0000)	0.8921 (0.0000)
MSRC	ANF	0.8048 (0.0000)	0.7297 (0.0000)	0.8143 (0.0000)
	SNF	0.8429 (0.0000)	0.7514 (0.0000)	0.8429 (0.0000)
	PFA	-	-	-
	MVSCO	0.7800 (0.0544)	0.6711 (0.0628)	0.7838 (0.0462)
	LRCMC	0.8905 (0.0000)	0.7922 (0.0000)	0.8905 (0.0000)
WebKB	ANF	0.6798 (0.0000)	0.1718 (0.0000)	0.6946 (0.0000)
	SNF	0.7044 (0.0000)	0.2407 (0.0000)	0.7192 (0.0000)
	PFA	0.7143 (0.0000)	0.3191 (0.0000)	0.8128 (0.0000)
	MVCMO	0.7652 (0.0346)	0.3548 (0.0448)	0.7833 (0.0323)
	LRCMC	0.8079 (0.0000)	0.5081 (0.0000)	0.8424 (0.0000)

- means that the metrics cannot be calculated, the best results have been highlighted in bold.

3.2. Comparison Experiments on TCGA Datasets

To demonstrate the effectiveness of LRCMC in identifying cancer subtype, the designed LRCMC was applied to four cancer omics datasets, i.e., GBM, BIC, LSCC, and COAD. Each cancer subtype contains three types of expression data from different platforms, i.e., mRNA expression data, DNA methylation data and miRNA expression data. Table 3 shows the number of samples (patients) and features (genes) held by each cancer subtype.

Table 3

Overview of the TCGA datasets.

Datasets	N	mRNA Expression	DNA Methylation	miRNA Expression
GBM	213	12,042	1305	534
BIC	105	17,814	23,094	354
LSCC	106	12,042	23,074	352
COAD	92	17,814	23,088	312

To ensure that the identified cancer labels conform to the true clinical diagnosis, we specified that the number of samples in each cluster should be at least 3. We used the number of subtypes of GBM, BIC, LSCC, and COAD specified by Wang et al., which were 3, 5, 4 and 3, respectively. Then, the p values based on Cox log-rank model were used to evaluate the clustering results of these algorithms in survival analysis [38]. If the p value is smaller, the survival rate between different groups is more significant and the difference is greater, which means the cluster is considered to have different characteristics of the underlying cancer subtypes. Cancer survival curves can also represent heterogeneity between different subtypes. As shown in Table 4, LRCMC obtained the best p value in BIC, GBM, KRCCC and COAD. Other algorithms also had good results in specific datasets, but they were all lower than our algorithm. Therefore, we believe that LRCMC is significantly advantageous in the topic of cancer subtype recognition. Figure 2 shows the Kaplan–Meier survival analysis curves of the four cancers. Each curve depicts trends in the survival time of each cancer cluster and the number of samples for each cluster is also shown in the figure.

Table 4

p values of survival analysis in Cox log-rank model for different clustering methods of four cancers on The Cancer Genome Atlas (TCGA) datasets.

Methods	GBM	BIC	LSCC	COAD
ANF	5.8 × 10⁻⁴	3.6 × 10⁻⁴	8.9 × 10⁻³	9.0 × 10⁻³
SNF	5.0×10⁻⁵	6.9×10⁻⁴	7.8 × 10⁻³	1.6 × 10⁻³
PFA	1.8×10⁻⁴	3.1×10⁻⁴	1.1 × 10⁻²	2.4 × 10⁻²
MVCMO	1.4×10⁻³	3.5×10⁻⁴	9.1 × 10⁻³	8.5 × 10⁻³
LRCMC	1.3 × 10⁻⁵	3.7 × 10⁻⁵	3.8 × 10⁻³	1.2 × 10⁻³

The best results have been highlighted in bold.

Figure 2

The Kaplan–Meier survival curves of (a): Glioblastoma Multiforme (GBM), (b): Breast Invasive Carcinoma (BIC), (c): Lung Squamous Cell Carcinoma (LSCC) and (d): Colon Adenocarcinoma (COAD), respectively.

3.3. Analysis on GBM Dataset

GBM is the most malignant glioma among astrocytomas. It has been studied and analyzed at the genetic level by many scholars, and specific subtypes and treatment protocols have been proposed. For example, according to the mRNA expression data, Verhaak et al. [39] reported that GBM is divided into Mesenchymal, Classical, Neural and Proneural subtypes, and the heterogeneous subtypes were also verified in somatic mutations and copy number variations (CNVs). Another study divided GBM patients into two subtypes, i.e., G-CLMP and non-G-CLMP, based on the difference of CpG Island methylator phenotype (CLMP) [40]. Table 5 shows the distribution of the cluster results obtained by LRCMC on the subtype identified by these two studies. From Table 5, there are more patients in cluster 1 than in cluster 3, and all of them are assigned to non-G-CLMP subtype—also they have four subtypes identified based on mRNA expression. The point is that the Proneural subtype in these two clusters belong to non-G-CLMP subtype. However, cluster 2, with a smaller number of patients, is almost the Proneural subtype, and also belongs to G-CLMP subtype.

Table 5

The identified clusters are compared with mRNA-expression-based subtypes and methylation-based subtypes.

Our Cluster	mRNA-Expression-Based Subtypes				Methylation-Based Subtypes
Our Cluster	Mesenchymal	Classical	Neural	Proneural	G-CLMP	Non-G-CLMP
cluster 1	46	54	27	30	0	155
cluster 2	1	0	1	19	20	1
cluster 3	12	11	7	7	0	37

The values represent the number of patients counted.

To further analyze the identified clusters, we downloaded clinical data, somatic mutation data and CNV data for all patients from the cBio Cancer Genomis Portal database (http://www.cbioportal.org/ accessed on 15 December 2020). The age profiles of the three clusters (Figure 3), differential gene statistics of CNVs and mutations (Table 6), and Kaplan–Meier survival curves of Temozolomide (TMZ) (Figure 4) in GBM patients were obtained. Figure 3 shows that the diagnosis age of patients in cluster 2 with the best survival advantage is also lower than that of patients in cluster 1 and cluster 3. The genetic variant signatures associated with GBM in terms of mutation (IDH1) and CNVs (CDKN2A, CDKN2B, C9orf53, MTAP, EGFR) are significantly different in the three identified clusters. In particular, IDH1 mutation only occurs in cluster 2, while EGFR amplification is 0. Then, we divided the patients within the three clusters into two groups: patients treated with TMZ and those not treated with TMZ, then we compared the drug response. TMZ is a drug that is commonly used to treat GBM, but only responds well to a subset of patients. The p values of survival analysis in Cox log-rank model of the three cluster comparison experiments are 2.0 × 10−6, 0.76 and 0.01, respectively, which indicates that TMZ treatment has no effect on the patients in cluster 2. Therefore, in summary, we can infer that the subtype belonging to G-CLMP subtype and Proneural subtype might be a potentially new subtype. This also verified by the fact that that the Proneural subtype granted by the G-CIMP phenotype proposed by Canmenron et al. has unique properties [41].

Figure 3

Boxplot of diagnosis age for the identified clusters. It reflects the distribution of diagnosis age in each cluster. Black bar represents the median of each cluster.

Table 6

Distribution of genetic variant signatures for the identified clusters.

Our Cluster	CDKN2A.del.	CDKN2B.del.	C9orf53.del.	MTAP.del.	EGFR.ampl.	IDH1
cluster 1	84 (56.4%)	84 (56.4%)	80 (53.7%)	57 (38.3%)	70 (47.0%)	0 (0%)
cluster 2	6 (28.6%)	6 (28.6%)	6 (28.6%)	5 (23.8%)	0 (0%)	10 (66.7%)
cluster 3	24 (68.9%)	23 (62.2%)	24 (68.9%)	21 (56.8%)	19 (51.4%)	0 (0%)

The values indicate the number of variations, and the values in parentheses indicate the frequencies of variations after removing statistical missing. ‘ampl.’: amplification, ‘del.’: deletion.

Figure 4

The Kaplan–Meier survival curves of the identified clusters (a): cluster 1, (b): cluster 2 and (c): cluster 3) of Temozolomide (TMZ) response. “Untreated” expresses the group which did not receive TMZ treatment and “Treated” expresses the group which received TMZ treatment.

In addition, mRNA expression data and DNA methylation data were used to compare the differentially expressed genes in cluster 1 and 3 to look for the heterogeneity between them. We compared the genes in the two clusters using ANOVA (the lower the p-value, the higher the ranking). The gene differences in miRNA expression data were not significant enough (p values were all greater than 0.1) and were omitted from consideration. Figure 5 shows the heatmaps of the top 20 differentially expressed genes in the mRNA expression data and the DNA methylation data, respectively. It is obvious that cluster 1 and cluster 3 are different in gene expression level, and some of the genes on the heatmaps have been shown to be linked to GBM., e.g., PRKAA1 overexpressed in Cluster 3, also known as AMPK, induces antitumor activity in GBM cells and has become a possible tumor control target [42]. MUC1 overexpressed in cluster 1 is a pathogenic gene that induces GBM and can be used as a target for cellular immunotherapy [43].

Figure 5

Heatmaps of differentially expressed genes in (a): mRNA expression data and (b): DNA methylation data for the identified clusters.

Finally, we compared the three clusters with normal samples and screened for differentially expressed genes using ANOVA. We did Gene Ontology (GO: BP), KEGG pathway and Disease Ontology (DO) enrichment analysis using the top 100 differential genes in ToppGene Suite database (https://toppgene.cchmc.org/enrichment.jsp accessed on 20 December 2020). From Table 7, it is clear that the biological processes of cluster 1 are related to “epithelium development” and “cell adhesion”, while the biological processes of cluster 2 and 3 are mostly related to “protein targeting” and “protein localization”. Moreover, it is interesting to note that all three clusters are associated with anemia in DO enrichment analysis. It is possible that GBM patients treated with TMZ will develop aplastic anemia [44].

Table 7

GO: BP, KEGG pathway, DO enriched terms for the identified cluster.

ENRICHMENT Analysis	Cluster 1	Cluster 2	Cluster 3
GO:BP enriched terms	1. Epithelial cell differentiation2. Epithelium development3. Cell adhesion4. Biological adhesion5. cell–cell adhesion	1. Protein targeting to ER2. Establishment of protein localization to endoplasmic reticulum3. Protein localization to endoplasmic reticulum4. Peptide metabolic process5. Protein targeting protein targeting	1. SRP-dependent cotranslational protein targeting to membrane2. Cotranslational protein targeting to membrane3. Nuclear-transcribed mRNA catabolic process, nonsense-mediated decay4. Protein targeting to ER5. Establishment of protein localization to endoplasmic reticulum
KEGG enriched pathway terms	1. Cell adhesion molecules (CAMs)2. Tight junction3. Pathogenic Escherichia coli infection4. Leukocyte transendothelial migration	1. Ribosome2. Protein export	1. Ribosome
DO enriched terms	1. alphaThalassemia2. Dysfibrinogenemia, congenital3. Afibrinogenemia, congenital4. Heinz body anemia	1. Diamond–Blackfan anemia	1. Diamond–Blackfan anemia

We put the GO: BP terms with ranking in the top 5, and KEGG pathway and DO terms with p-value less than 1.00E-4 in the table.

4. Discussion and Conclusions

Over the past decade, it has been widely recognized that the integration and mining of different types of biological data provides meaningful insights into the causes and complexity systems of cancer [45]. Now, the challenge is still how to capture the underlying structure of the sample/features from the omics data for application to a wide range of bioinformatics topics, e.g., the prediction of drug-target relationships [46], the recognition of cancer driver genes [47], finding out about genotype–epigenetic interactions [48], etc. In this paper, our proposed LRCMC algorithm has the ability to fuse multigenomic data into the consensus graph of the exact connected components. In fact, the sample size of a given cancer is generally relatively small, so the graph learning method can quickly map the feature space into the structure of the affinity graph without the need for feature prescreening. Based on the framework of graph learning, LRCMC uses the operation of LRC to mine the structure of clustering while maintaining the graph structure. At the same time, the graph obtained by the adaptive neighbors method is sparse, so that the weak similarity relationship is even more sparse at 0, which ensures more accurate clustering results. Compared with other start-of-the-art integration clustering algorithms for cancer subtype recognition, LRCMC has the following two characteristics: (1) instead of simply treating each view equally, a wealth of heterogeneous information is taken into account to provide appropriate weight for each view; (2) the tasks of constructing the affinity matrix of each view, learning the fused matrix and clustering are completed simultaneously in a system. Furthermore, LRCMC has the following two advantages in algorithm running: (1) there is no need to spend a lot of time choosing the appropriate parameters; (2) the final consensus graph has been assigned to the given categories without adding additional base clustering algorithms. We demonstrated the power of LRCMC using four benchmark datasets and four cancer datasets. The experiments show that LRCMC has a good clustering evaluation. The cancer subtype recognition results on GBM data show that LRCMC can effectively capture cancer subtypes with specific biological characteristics based on omics data. In addition, we must admit that LRCMC also has shortcomings and limitations. It is not suitable for binary data (somatic mutation) or categorical data (copy number states: loss/normal/gain), and has only limited application to continuous data (mRNA expression) to identify cancer subtype. It also does not have the ability to find the gene modules that affect differences in each subtype. Therefore, we will continue our efforts to improve and extend the LRCMC algorithm to explore cancer heterogeneity.

28 in total

1. On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I.

Authors: K Fan
Journal: Proc Natl Acad Sci U S A Date: 1949-11 Impact factor: 11.205

2. Pattern discovery and cancer gene identification in integrated cancer genomic data.

Authors: Qianxing Mo; Sijian Wang; Venkatraman E Seshan; Adam B Olshen; Nikolaus Schultz; Chris Sander; R Scott Powers; Marc Ladanyi; Ronglai Shen
Journal: Proc Natl Acad Sci U S A Date: 2013-02-21 Impact factor: 11.205

3. LRSSL: predict and interpret drug-disease associations based on data integration using sparse subspace learning.

Authors: Xujun Liang; Pengfei Zhang; Lu Yan; Ying Fu; Fang Peng; Lingzhi Qu; Meiying Shao; Yongheng Chen; Zhuchu Chen
Journal: Bioinformatics Date: 2017-04-15 Impact factor: 6.937

Review 4. Chimeric antigen receptor T-cell therapy for glioblastoma.

Authors: Analiz Rodriguez; Christine Brown; Behnam Badie
Journal: Transl Res Date: 2017-07-15 Impact factor: 7.012

5. Similarity network fusion for aggregating data types on a genomic scale.

Authors: Bo Wang; Aziz M Mezlini; Feyyaz Demir; Marc Fiume; Zhuowen Tu; Michael Brudno; Benjamin Haibe-Kains; Anna Goldenberg
Journal: Nat Methods Date: 2014-01-26 Impact factor: 28.547

6. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.

Authors: Ronglai Shen; Adam B Olshen; Marc Ladanyi
Journal: Bioinformatics Date: 2009-09-16 Impact factor: 6.937