Literature DB >> 35860412

Multi-omics data integration for subtype identification of Chinese lower-grade gliomas: A joint similarity network fusion approach.

Lingmei Li¹, Yifang Wei¹, Guojing Shi¹, Haitao Yang², Zhi Li³, Ruiling Fang¹, Hongyan Cao^1,4, Yuehua Cui⁵.

Abstract

Lower-grade gliomas (LGG), characterized by heterogeneity and invasiveness, originate from the central nervous system. Although studies focusing on molecular subtyping and molecular characteristics have provided novel insights into improving the diagnosis and therapy of LGG, there is an urgent need to identify new molecular subtypes and biomarkers that are promising to improve patient survival outcomes. Here, we proposed a joint similarity network fusion (Joint-SNF) method to integrate different omics data types to construct a fused network using the Joint and Individual Variation Explained (JIVE) technique under the SNF framework. Focusing on the joint network structure, a spectral clustering method was employed to obtain subtypes of patients. Simulation studies show that the proposed Joint-SNF method outperforms the original SNF approach under various simulation scenarios. We further applied the method to a Chinese LGG data set including mRNA expression, DNA methylation and microRNA (miRNA). Three molecular subtypes were identified and showed statistically significant differences in patient survival outcomes. The five-year mortality rates of the three subtypes are 80.8%, 32.1%, and 34.4%, respectively. After adjusting for clinically relevant covariates, the death risk of patients in Cluster 1 was 5.06 times higher than patients in other clusters. The fused network attained by the proposed Joint-SNF method enhances strong similarities, thus greatly improves subtyping performance compared to the original SNF method. The findings in the real application may provide important clues for improving patient survival outcomes and for precision treatment for Chinese LGG patients. An R package to implement the method can be accessed in Github at https://github.com/Sameerer/Joint-SNF.

Entities: Chemical

Keywords: Joint-SNF; LGG; Multi-omics data integration; Subtypes identification

Year: 2022 PMID： 35860412 PMCID： PMC9284445 DOI： 10.1016/j.csbj.2022.06.065

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Lower-grade gliomas (LGG), including diffuse low-grade and intermediate-grade gliomas (World Health Organization grades II and III) are the most common infiltrative neoplasms that occur in adult cerebral hemispheres [1]. Most patients exhibit high postoperative recurrence risk [2], and may further deteriorate into glioblastomas (grade IV, GBM). Historically, histologic classifications and tumor grades of LGG have been used to assist therapeutic interventions. However, patients with the same grade often have distinct molecular characteristics and prognosis [3]. With the rapid development of molecular biology research on LGG, identification of molecular subtypes and biomarkers have been explored to guide clinical decision-making [4], [5]. The identification of a group of genetic lesions including isocitrate dehydrogenase 1/2 (IDH1/2) mutation and codeletion of chromosome 1p and 19q (1p/19q) [6], [7] has been a major progress in recent years. Based on these two genetic alterations, cumulative evidence shows that LGG can be classified into three subtypes with different clinical outcomes [8]. Patients with IDH mutation (IDH-mut) have longer survival than those with IDH wild-type (IDH-WT) [9]. Nonetheless, the current biomarkers still cannot adequately predict the overall survival for all LGG patients. For example, IDH-WT may occur in WHO grades II gliomas or in recurrent gliomas [10]. Moreover, due to substantial heterogeneity between LGG, the elaboration of optimal therapeutic strategies at the individual level is still a great challenge [11]. Thus, there is a pressing need to develop reliable approaches to identify patients with high risk of deterioration and find new molecular targets for developing effective treatment strategies. Recent technological advances allow us to understand the onset and progression of tumors and identify the risk factors, molecular basis and prognostic biomarkers underlying invasive tumors [12], [13]. Similar to other subtyping studies, multi-omics data integration remains the preferred approach to obtain the accurate subtype of LGG patients. Multi-omics data integration enables the joint analysis of multiple data types to provide a comprehensive understanding of the biological system and offer insights into the crucial associations between different omics data types [14]. It has been a well-established strategy for identifying molecular subtypes and elucidating pathogenesis in cancer [15]. However, multi-omics integration faces several major challenges, such as the curse of dimensionality and the modeling of interactions between the different types of omics data [16], [17]. Methods that can address the potential challenges in multi-omics integration can be largely classified as multivariate, concatenation-based and transformation-based methods [18]. Multivariate methods such as partial least squares or canonical correlation analysis, treat different omics data individually to discover associations between them. Concatenation-based integration combines omics data into a single matrix which is then used as input for low rank-based approximation or latent factor analysis in a low-dimensional space. Focusing on the shared information and integrative dimension reduction of multi-omics data, Lock et al. [19] proposed the Joint and Individual Variation Explained (JIVE) method, a typical example of the concatenation-based method. It uses a decomposition method and segregates the combined omics data matrix into three terms, a low-rank joint variation matrix between data sets, a low-rank individual specific matrix and the residual noise. The method can separate synergistic activities common to all data types from individual ones specific to a particular data type. This method was applied to gene expression and miRNA of GBM tumor samples from the TCGA database, showing better characterization of tumor types and better understanding of the biological interactions between different data types. The transformation-based approaches integrate omics data after transforming each omics data type into an intermediate and common form (e.g. graph or kernel matrix). They have the advantage of capturing individual omics characteristics in the transformation step and are robust to different data measurement scales [20]. One of the popular methods is the Similarity Network Fusion (SNF) algorithm [21] which creates an individual sample similarity matrix for each data type and then fuses these into a single similarity network using a message passing theory, making the combined networks more coherent during each iteration. Like any transformation-based methods, no feature selection step is required [22]. However, due to the limitation of measurement technology and inherent natural variation, unavoidable noise features can dilute clustering signals, leading to potential spurious associations between samples [23]. Collectively, the two methods are largely complementary and individually, and they have their own merits in certain aspects. To fully utilize the strength of the two methods and achieve better subtyping performance, in this work, we proposed a Joint-SNF method which employs SNF to obtain the fused sample similarity matrix by integrating the joint structure extracted by the JIVE method. The fused network matrix enhances strong similarities and weakens spurious associations between samples while reducing the noise. We performed simulation studies to compare the performances of the proposed Joint-SNF method with the original SNF method. We further applied the Joint-SNF method to integrate mRNA, DNA methylation and miRNA expression data obtained from the CGGA database, aiming to discover molecular subtypes of Chinese LGG patients with different prognoses. For the identified subtype with the worst prognosis, in-depth bioinformatics analysis was conducted to uncover key pathways and biomarkers that could explain the underlying molecular mechanism. Our method offers a promising new strategy to the toolbox of cancer subtyping with multi-omics data integration.

Materials and methods

Study cohort

The data in this study included clinical data and three types of omics data, downloaded from the Chinese Glioma Genome Atlas database [24] (CGGA, https://www.cgga.org.cn). CGGA, an open-access database, was launched by a team from Beijing Tiantan Hospital, Capital Medical University in 2012 and opened in 2019. LGG patients (WHO grade II and III) with survival time (from initial diagnoses to death, or to the last follow up) and survival status were included for further analysis. All three types of omics data including gene-level mRNA expression data (mRNA-array_301), gene-level DNA methylation data (methyl_159) and gene-level miRNA expression data (microRNA_198) were available for 86 LGG patients.

Statistical method

Joint similarity network fusion (Joint-SNF)

Joint-SNF method uses SNF to integrate the joint structure extracted by JIVE method to obtain the fusion matrix. The fused network enhances strong similarities and weakens spurious associations between samples while reducing the noise. The realization of Joint-SNF method relies on the following two important algorithms. Suppose there are k data types and each is measured on features over n samples and is represented by a data matrix with dimension . The k data matrices are merged to form a single data matrix , i.e., To eliminate baseline differences caused by different data dimensions and scales, each data type is centered by row-wise subtraction of its means, then scaled by applying Frobenius norm, i.e., . (1) Extraction of joint structures by JIVE. JIVE is a method of integrating multiple datasets via a general decomposition of variation. The decomposition is composed of three parts: a low-rank approximation capturing the joint variation of different data types, low-rank approximations reflecting individual variation of each data type and the residual noise [19]. Each appropriately scaled matrix X can be decomposed into three terms: joint structure matrix associated with , individual structure matrix of and the error matrix . This gives the factorized model, The model assumes that for , that is, the joint and individual terms are uncorrelated. Low-rank constraints are imposed on both the joint and individual structure matrix (i.e., rank () = r < rank , rank () = r ). The rank (r) of the joint structure associated with is assumed to be the same for different data types. The rank r and r are estimated via a permutation testing approach. Then, the matrices and can be obtained using singular value decomposition (SVD). The specific procedure is summarized in Algorithm 1.Since the joint structure matrix associated with captures the common structures shared between different data types, it contains common information that can potentially enhances the subtyping performance. Thus, they are used as the input matrix into SNF method for subsequent subtyping. (2) Similarity Network Fusion (SNF). SNF is a similarity-based method to integrate multi-omics data by constructing and fusing sample-sample similarity networks of patients [21]. Suppose we have n samples and r joint structures. A patient similarity network is denoted as a graph G = (V, E). The nodes V are patients and weighted edges E form an similarity matrix measuring the similarity between patients and . is computed by a scaled exponential similarity kernel as follows,where denotes the Euclidean distance between patients and , is a hyperparameter that can be empirically set, which is used for removing the scaling problem, and is the average value of the distances between and each of its neighbors. After constructing sample-sample similarity matrices from multiple data sources, we then fuse these similarity matrices into one similarity network. Procedure of fusion is summarized in Algorithm 2. Given matrix W, a normalized kernel matrix P carrying the full information about the similarity to all others for each patient and a local kernel matrix S encoding the similarity of each patient to its nearest neighbors are obtained. Then, and for the vth joint structure ( can be obtained. A message-passing process is then used to iteratively update similarity networks to realize the fusion of networks.This fusion process converges to a single similarity network that summarizes the similarity between samples across all omics data types sharing the common structures. represents the final fused network. The network obtained by the Joint-SNF method is used for further spectral clustering analysis which can capture the global structure of a graph [25].

Simulation study

We carried out simulation study to demonstrate the performance of the Joint-SNF method by comparing it with the original SNF method. The simulation design follows the following principles: (i) Each data type has an independent clustering structure, as well as overlapping parts with other omics data types; (ii) The overall clustering structure can be obtained only by integrating information from all omics data types; and (iii) All data types are contaminated with Gaussian noises.

Simulation settings

The generation of simulated datasets is similar to those reported elsewhere [26], [27]. Here, we considered 200 samples including three types of omics data with 1000 features each. These 200 samples were pre-defined as four subtypes, each with 50 samples. To equip the simulated data matrix with a preset clustering structure, three types of omics data matrices were constructed by setting , where represents random noises; is the data type index; is the mean expression level of the features in data type . The four mean groups represent four subtypes among the samples. Specifically, samples 1–50, 51–150, and 151–200 in with samples 1–50/101–150, 51–100, and 151–200 in with samples 1–100, 101–150, and 151–200 in with . We also varied the noise level to make the clustering more challenging by generating three datasets named SimData1 (), SimData2 () and SimData3 (). It is expected that high variance (hence high noise level) will make it more difficult to separate the four clusters. To evaluate the performance of each method at different proportions of signal features, three different signal levels of low, moderate and high signal (5%, 10% and 15%) were considered for each simulated dataset. Each simulation scenario was repeated 1000 times.

Simulation results

The standardized mutual information (NMI) was considered as a criterion to evaluate the performance. The larger the NMI value, the closer the relationship between the clustering structure and the real label. Shown in Table 1 are the averaged NMI values out of the 1000 simulation runs together with the standard error. The method of using SNF to integrate both joint and individual structure is referred to as JIVE-SNF. Additionally, we have made comparisons with other popular multi-omics integrative clustering methods such as Cancer Integration via Multikernel Learning (CIMLR) [28] and integrative non-negative matrix factorization (IntNMF) [29]. Overall, the Joint-SNF method shows superior performance over the other methods in different simulation scenarios in terms of NMI measures. In particular, when the noise level and the percentage of signal features were high, the NMI of Joint-SNF, JIVE-SNF, SNF, IntNMF and CIMLR are 0.650, 0.339, 0.325, 0.328 and 0.381, respectively, showing great advantage of Joint-SNF over other methods in recovering the true clustering structures. As expected, the NMI obtained by most methods (except JIVE-SNF) increases with increasing signal features at the same noise level.

Table 1

The averaged NMI on simulated dataset with the standard errors given in the parenthesis.

Method	SimData1(σ2=8)			SimData2(σ2=12)			SimData3(σ2=16)
	Low	Moderate	High	Low	Moderate	High	Low	Moderate	High
Joint-SNF	0.597 (0.059)	0.660 (0.056)	0.670 (0.059)	0.484 (0.053)	0.641 (0.058)	0.656 (0.056)	0.372 (0.069)	0.598 (0.057)	0.650 (0.057)
JIVE-SNF	0.364 (0.094)	0.346 (0.076)	0.360 (0.067)	0.307 (0.087)	0.346 (0.093)	0.339 (0.072)	0.235 (0.080)	0.358 (0.098)	0.339 (0.080)
SNF	0.265 (0.032)	0.356 (0.033)	0.466 (0.054)	0.196 (0.034)	0.308 (0.031)	0.362 (0.034)	0.131 (0.035)	0.277 (0.031)	0.325 (0.031)
IntNMF	0.294 (0.031)	0.351 (0.036)	0.414 (0.049)	0.251 (0.030)	0.310 (0.028)	0.339 (0.023)	0.205 (0.039)	0.285 (0.037)	0.328 (0.031)
CIMLR	0.293 (0.042)	0.449 (0.058)	0.668 (0.052)	0.170 (0.036)	0.353 (0.044)	0.452 (0.062)	0.091 (0.043)	0.303 (0.042)	0.381 (0.038)

The averaged NMI on simulated dataset with the standard errors given in the parenthesis.

Real data applications

In this study, we used the data of 86 LGG patients from the CGGA database, aged from 17 to 65 years old, with an average age of 38.5 years. Their baseline characteristics were presented in Table 2. A total of 52 patients (60.5%) of histopathologically confirmed grade II and 34 patients (39.5%) of histopathologically confirmed grade III were included. In addition, the gender composition of the patients was about 46.5% for female and 53.5% for male. The majority of patients were primary and only 5 patients were recurrent. By the last follow-up, 44 patients survived and 42 patients died, the survival time ranged from 90 to 5159 days.

Table 2

Baseline characteristics of 86 LGG patients.

Item	Classification	n (%)/mean ± SD
Age, years		38.56 ± 11.60
Gender	Female	40(46.5)
Gender	Male	46(53.5)
WHO grade	II	52(60.5)
WHO grade	III	34(39.5)
Sample type	Primary	81(94.2)
Sample type	Recurrent	5(5.8)
Survival outcome	Dead	42(48.8)
Survival outcome	Alive	44(51.2)
IDH_mutation_status	Mutant	59(68.6)
	Wildtype	26(30.2)
	NA	1(1.2)

Baseline characteristics of 86 LGG patients. We applied Joint-SNF to a total of 86 Chinese LGG patients using three data types including mRNA expression (19,416 mRNAs), miRNA expression (827 miRNAs) and DNA methylation (14,476 genes). Fig. 1 shows the flowchart of the LGG subtyping analysis using the Joint-SNF method. Specifically, the first step is to extract the joint structures among mRNA, miRNA and DNA methylation data with a low rank approximation, then fuse these structures to construct a network to boost similarities and weaken spurious associations between samples for further spectral clustering. Finally, the molecular subtypes of LGG patients can be obtained based on the fused network using the SNF algorithm.

Fig. 1

Schematic representation of the Joint-SNF method used for LGG subtyping.

Subtyping of LGG using Joint-SNF

Applying Joint-SNF, we obtained three subtypes. We further conducted Kaplan-Meier survival analysis to test whether the survival risks among different subtypes identified by Joint-SNF were clinically significant. The log-rank test was performed. The Kaplan–Meier curves constructed by Joint-SNF and SNF are shown in Fig. 2. Clearly, compared to the result of SNF, survival curves of different types obtained by Joint-SNF do not overlap, revealing significant difference. Combining with the p-value result and the previous report [1], we divided Chinese LGG patients into three subtypes. The survival rate of patients with different subtypes is significantly different ( = 32.8, P = 7.48E-08).

Fig. 2

Kaplan-Meier curves showing overall survival for the three subtypes of LGG obtained by Joint-SNF (A) and SNF (B).

Kaplan-Meier curves showing overall survival for the three subtypes of LGG obtained by Joint-SNF (A) and SNF (B). We further explored prognostic value of the subtypes in LGG patients identified by Joint-SNF. Fig. 2A demonstrates the overall survival of different subtypes. A total of 26 patients (30.2%) in Cluster 1 had a 5-year mortality rate of 80.8%, 28 patients (32.6%) in Cluster 2 with a 5-year mortality rate of 32.1%, and 32 patients (37.2%) in Cluster 3 with a 5-year mortality rate of 34.4%. In addition, clinical characteristics are different among different clusters. The results in Table 3 show that compared with the other two clusters, patients in Cluster 1 with the worst prognosis tend to be more older, and most patients are histopathologically confirmed grade III.

Table 3

Clinical and pathological characteristics of different subtypes.

Characteristic	Cluster 1(n = 26)	Cluster 2 (n = 28)	Cluster 3(n = 32)
Age, years	42.65 ± 14.61	37.61 ± 8.75	36.06 ± 10.42
Female, n (%)	12(46.1)	14(50.0)	14(43.8)
WHO grade, n (%)
Grade II	1(3.8)	26(92.9)	25(78.1)
Grade III	25(96.2)	2(7.1)	7(21.9)
Death event, n (%)	20 (76.9)	6 (21.4)	16 (50.0)

Clinical and pathological characteristics of different subtypes.

Comparison of Joint-SNF with SNF in subtyping

We compared the subtyping results of Joint-SNF and SNF on LGG to evaluate the differences among the identified subtypes. The p-value result showed that Joint-SNF performed better in identifying clusters significantly associated with patient survival for a fixed number of clusters (see Table 4).

Table 4

Comparison of log-rank test p-value of Joint-SNF and SNF across different numbers of subtypes.

Method	p-value under different numbers of clusters
	3	4	5
Joint-SNF	7.48E-08	6.28E-07	3.10E-07
SNF	1.17E-07	6.37E-07	4.29E-04

Comparison of log-rank test p-value of Joint-SNF and SNF across different numbers of subtypes. Considering the small sample size in the LGG dataset, we further conducted stability analysis to check the robustness of the subtyping performance with Joint-SNF and SNF following the work by [30], [31]. Specifically, we randomly sampled 75% of the LGG patients and performed subtyping using Joint-SNF and SNF assuming different number of clusters (e.g., 3, 4 and 5) and repeated this process 20 times. For each sample split, we conducted a log-rank test to test the difference of the survival curves under the assumed number of clusters. The distribution of the log-rank test p-values obtained by the two methods is displayed in Fig. 3. Overall, Joint-SNF performs better than SNF, though the difference is subtle when the number of clusters is 3. The mean p-values over the 20 repetitions are summarized in Table 5. The results show that the mean performance of Joint-SNF is better than SNF, in the sense that the survival curves obtained by Joint-SNF can be better differentiated. This stability analysis shows the robustness of Joint-SNF in subtyping the 86 LGG patients.

Fig. 3

Boxplots of the -log10(p-value) obtained with the log-rank test for the difference of the survival curves assuming different numbers of clusters using Joint-SNF and SNF over 20 random sample splits.

Table 5

The mean p-value of the log-rank test over 20 random sample splits.

Method	mean p-value of the log-rank test
	3	4	5
Joint-SNF	8.61E-05	1.83E-04	6.71E-04
SNF	1.68E-04	1.64E-03	1.11E-02

Boxplots of the -log10(p-value) obtained with the log-rank test for the difference of the survival curves assuming different numbers of clusters using Joint-SNF and SNF over 20 random sample splits. The mean p-value of the log-rank test over 20 random sample splits.

Association of prognosis with the identified molecular subtypes

Controlling clinic pathological variables such as age, gender and grade, we performed Cox regression analysis to assess the association between the three subtypes and LGG survival outcomes. As depicted in Table 6, patients in Cluster 1 were 5.06 times higher in risk of death than patients in Cluster 2.

Table 6

Cox regression results of 86 LGG patients.

Item	Coefficient (SE)	Wald Z	P	HR (95% CI)
Subtypes
Cluster1*	1.622(0.626)	2.591	0.009	5.062(1.484,17.260)
Cluster3	0.885(0.488)	1.814	0.070	2.423(0.931,6.306)
Age	0.347(0.325)	1.067	0.286	1.415(0.747,2.679)
Gender	0.020(0.320)	0.064	0.949	1.020(0.545,1.911)
WHO grade	0.603(0.487)	1.240	0.215	1.829(0.704,4.748)

Note: *Showing statistical significance at the 0.05 significance level. Cluster 2 was used as the reference group for subtype comparison. When considering the influence of age, patients were divided into two groups with 36 years old as the cutoff value ( vs.).

Cox regression results of 86 LGG patients. Note: *Showing statistical significance at the 0.05 significance level. Cluster 2 was used as the reference group for subtype comparison. When considering the influence of age, patients were divided into two groups with 36 years old as the cutoff value ( vs.).

Biological implications between the identified molecular subtypes

To elucidate the differential manifestations of different molecular subtypes, we performed pathway activity analysis using PROGENy. Kruskal–Wallis test was used to assess biological pathways that show different activities between subtypes. The threshold was set as . As shown in Fig. 4, five pathway activities showed significant differences between the three clusters (p < 0.05), with Cluster 1 showing the highest activity in EGFR, VEGF, MAPK, p53 pathways and the lowest Androgen activity. Various signaling pathways are linked to the pathogenesis of different cancers and are considered as potential hallmarks for cancer targeted therapy. Inhibition of certain disease-related signaling pathways may be a promising strategy in cancer prevention or treatment. Thus, the inhibition of EGFR, VEGF, MAPK and p53 pathway activities might lead to improved prognosis of Cluster 1 patients.

Fig. 4

Boxplots of the pathway activity for 5 pathways in different subtypes.

Co-expression network construction and core module identification

We carried out weighted gene co-expression network analysis (WGCNA) to identify gene modules associated with prognosis of LGG patients focusing on the mRNA expression data. A total of top 5000 genes (according to median absolute deviation) were screened to construct the mRNA co-expression network using the R package WGCNA [32]. Briefly, the adjacency matrix was converted into a topological overlap matrix (TOM) when setting power of β to 6 (R = 0.86). Then, we used a dynamic shear tree algorithm to identify gene modules and further merged the relevant modules following a height cutoff of 0.25. Finally, the core modules that may be highly correlated with prognosis in patients were selected for subsequent analyses by associating module eigengenes which summarize the expression of each module with clinical traits. Fifteen co-expression modules were identified (see Fig. 5A), not including the grey module. A heat map showing the module-trait relationship was used to assess the relationship between each co-expression module with the LGG subtype traits (Cluster 1, Cluster 2, Cluster 3) and other clinical features (WHO grade, Gender, Age, Overall survival). As shown in Fig. 5B, the yellow module was strongly correlated with Cluster 1 (r = 0.74, P = 7E-16) and overall survival (r = -0.52, P = 3E-07). Given that the study goal is to find new therapeutic targets and prolong survival time of patients with extremely poor prognosis, we selected the yellow module for subsequent analysis.

Fig. 5

(A) Dendrogram representing hierarchical clustering of identified co-expressed modules. (B) Heatmap visualizing the correlation between Eigengene of modules and clinical traits of LGG. Each row represents a color module, and each column represents a clinical feature. Each cell is filled with the correlation and p-value.

Functional annotation and enrichment analysis of the core module

To identify the potential biological processes and pathways for 478 genes in the yellow module, Gene Ontology [33] (GO) and Kyoto Encyclopedia of Genes and Genomes [34] (KEGG) analysis were carried out using the R package clusterProfiler [35], to obtain the relevant biological function categories and signaling pathways. The cutoff criterion is set to p-value < 0.05 and q-value < 0.01. As presented in Fig. 6A, these genes were mainly enriched for the following GO terms: nuclear division, organelle fission, chromosome segregation and negative regulation of cell cycle process. In addition, KEGG analysis revealed that these genes were enriched in 11 pathways including cell cycle, p53 signaling pathway, small cell lung cancer and oocyte meiosis (Fig. 6B).

Fig. 6

GO biological process enrichment analysis (A) and KEGG enrichment analysis (B) for 478 genes in yellow module. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Hub gene identification

Candidate genes were defined as genes correlated with module eigengenes (cor. gene ModuleMembership > 0.9) and clinical traits (cor. gene TraitSignificance > 0.3). As such, we screened 28 candidate genes from the yellow module according to the criteria. The expression heatmap of candidate genes in different subtypes was presented in Fig. 7A. It can be seen that the expression levels of candidate genes vary in different subtypes. More specifically, the expression of candidate genes was higher in Cluster 1 which has the worst prognosis.

Fig. 7

(A) Heatmap reflecting the expression level of candidate genes in the three subtypes. Each row corresponds to a gene feature and each column corresponds to a patient. Red and blue colors indicate relatively high and low gene expressions. The three colored bars on the top indicate subtype cluster 1, 2 and 3 from left to right. (B) Network diagram of the interactions between hub genes (red) and candidate genes (blue). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) To identify hub genes, CytoHubba plugin in the Cytoscape software was employed to measure the Maximal Clique Centrality (MCC) score of candidate genes. MCC has been considered as a powerful indicator for identifying central nodes in co-expression networks [36]. The top 10 highly connected genes were used as the hub genes for further analysis, namely NCAPH, CENPM, CDC45L, TK1, FAM64A, UBE2C, BIRC5, KIAA0101, OIP5, HAUS8. The interaction between hub genes and candidate genes was visualized by using the Cytoscape software. Shown in Fig. 7B, we can see that candidate genes are less connected to all other candidate genes and more connected to the hub genes, indicating the importance of the hub genes in regulating other genes.

Evaluation of the prognostic value of hub genes

To verify the prognostic value of the 10 hub genes, all patients were divided into two groups based on the median expression value of hub genes, with patients higher than or equal to the median value assigned to the high-level group and patients lower than the median value assigned to the low-level group. We performed survival analysis to evaluate the statistical significance of survival outcomes between two groups using R package survival and survminer. Kaplan-Meier analysis showed that 9 of the 10 hub genes (FAM64A, OIP5, NCAPH, KIAA0101, UBE2C, TK1, CDC45L, BIRC5, CENPM) were significantly correlated with prognosis (). The relationship between all 9 hub genes and the prognosis of LGG patients was such that the higher expression of the gene, the poor the prognosis (see Fig. 8).

Fig. 8

Plots showing prognostic survival curves of the 9 hub genes sorted in ascending order by p-value.

Discussion

In this study, we proposed a new method called Joint-SNF to integrate multi-omics data to identify molecular subtypes. The Joint-SNF method is a disease subtyping method making use of the correlation and complementary information between different omics data types. The fused network obtained by the Joint-SNF method enhances strong similarities and weakens some spurious associations between samples while reducing the noise. This method separates signals common to all data types from individual ones and avoids the negative impact of irrelevant omics data on cancer subtyping. By extracting the joint structure between omics data types, the original data can be effectively reduced in dimensionality without losing key information. Both simulation studies and LGG subtyping application have demonstrated that Joint-SNF achieves efficient and accurate subtyping compared to the original SNF method based on original features. Three LGG subtypes (Cluster 1, 2 and 3) identified by Joint-SNF differed significantly in survival outcome. We observed that Cluster 1 with 26 subjects had the worst survival rate with a high 5-year mortality rate of 80.8%, compared to the other two clusters with a 5-year mortality rate of 32.1% and 34.4%. Furthermore, after adjusting for the effects of covariates, patients in Cluster 1 were 5.062 times higher in mortality compared to patients in Cluster 2. Focusing on subtypes, we further investigated some unique manifestations of different subtypes through bioinformatics analysis and explored their clinical value, especially whether they could help improve the survival time of patients with poor prognosis. We obtained gene modules that affect the prognosis of LGG patients through WGCNA analysis, of which the yellow module had the highest correlation with prognosis. This indicates that the critical genes in the yellow module may serve as potential biomarkers affecting the progression of Chinese LGG patients. We further analyzed a total of 478 genes with co-expression trends identified in the yellow module. GO functional annotation analysis of these genes showed that they were mainly enriched in nuclear division, organelle division, chromosome separation and negative regulation of cell cycle process. The above biological process are involved in regulating the growth and proliferation of cancer cells and are associated with the recurrence of LGG [37], [38]. These genes were subjected to the KEGG pathway enrichment analysis which showed that they were associated with various cancer pathways, such as the p53 signaling pathway, the small cell lung cancer and the cell cycle pathway. The cell cycle and p53 signaling pathway have been reported to play a crucial role in the development of LGG [39]. To identify critical genes in the yellow module, we first obtained candidate genes based on the association among genes and the association between the gene set and the clinical subtypes. The results showed that the expression levels of candidate genes were different in different groups, with high expression levels in Cluster 1, showing the importance of these genes with poor prognosis in Cluster 1. We further screened 10 hub genes according to the MCC score to further investigate their prognostic value. We analyzed the survival of LGG patients with high and low expression of these genes and found that 9 of 10 hub genes were associated with prognosis. Four of these 9 genes have been reported to be related to gliomas. UBE2C, a member of the E2 ubiquitin-conjugating enzyme family, plays a key role in cell cycle control, cell signal transduction and cell differentiation. Additionally, the previous study has shown that UBE2C is overexpressed in LGG and its overexpression can lead to poor prognosis [40]. This is consistent with the results of our study, the high expression level of UBE2C is associated with poor prognosis. BIRC5, also known as survivin, is an immune-related gene belonging to the apoptotic gene family. It has been reported that BIRC5 may be a potential biomarker and therapeutic target for LGG [41]. Overall, the UBE2C and BIRC5 might be promising candidate biomarkers for improving prognostic outcomes of Chinese LGG patients, although further biological validations are needed. KIAA0101 encodes a conserved protein which plays an essential role in the regulation of various biological processes [42]. Recently, Liu et al. [43] reported that KIAA0101 is overexpressed in gliomas, and its expression level was positively correlated with the grade of gliomas. Opa Interacting Protein 5 (OIP5) is a cancer-testis specific gene participated in various tumor biological processes [44]. Recent research has shown that it is upregulated in glioblastoma patients and correlated with poor prognosis [45]. Non-SMC condensin I complex subunit H (NCAPH) encodes a member of the Barr gene family and a regulatory subunit of the condensin complex. In addition, NCAPH was reported to promote tumor formation, proliferation and metastasis [46], [47]. Centromere protein M (CENPM) is a component of the CENPA-NAC (nucleosome-associated) complex, which plays a central role in the assembly of kinetochore proteins, mitotic progression, and chromosome segregation [48]. It has been reported as a novel biomarker of hepatocellular carcinoma [49], melanoma [50] and bladder cancer [51]. FAM64A (also known as RCS1, PIMREG) plays important biological functions in various cells by accelerating the cell cycle and is abnormally expressed in many tumor tissues [52]. Current studies have found that FAM64A was remarkably highly expressed in tumor tissues and cells of patients with Lung Adenocarcinoma [53], and pancreatic cancer [54]. Thymidine kinase l (TK1) has been found to be closely related to cancer proliferation [55]. Cell division cycle 45-like (CDC45L) has a critical role in the initiation and elongation steps of DNA replication [56], and it is regarded as a promising proliferation marker in tumor cell biology [57]. Although these genes have not been directly reported in gliomas studies, their basic biological functions and carcinogenic properties have been elucidated. This also suggests that they have the potential to affect the occurrence and development of LGG. Moreover, our results demonstrate that the high expression of these genes leads to poor prognosis of patients. Further studies to elucidate their specific mechanisms in LGG are needed. Our proposed Joint-SNF method provides a new strategy for integrated analysis of multi-omics data and has been successfully applied to LGG patients subtyping. The fused network obtained by Joint-SNF enhances strong similarities and weakens spurious associations between samples while reducing the noise. In addition, this method separates signals common to all data from individual ones and effectively reduces the dimension of original data without losing key information. Overall, our findings may provide novel insights into the subtype of LGG patients and provide important clues for improving patient survival outcomes and for the option of individualized treatment.

CRediT authorship contribution statement

Lingmei Li: Formal analysis, Methodology, Software, Writing – original draft. Yifang Wei: Formal analysis. Guojing Shi: Formal analysis. Haitao Yang: Visualization. Zhi Li: Conceptualization. Ruiling Fang: Data curation. Hongyan Cao: Conceptualization, Methodology, Writing – review & editing. Yuehua Cui: Conceptualization, Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Algorithm 1: JIVE decomposition
Require: a scaled matrix X
Output:
Lists of the joint and individual structure matrices
Details:
1. A singular value decomposition is performed for each data. Using the singular values (Σ) and the right singular values (V), the reduced data set is ΣV′.
2. Set J = {J1, …, Jk} by a rank r singular value decomposition of a scaled matrix X. Save the right singular values (V).
3. Set Ai=Ai∏k≠i(I-ViVi′) by a rank r_i singular value decomposition of X_Individual = (X-J)(I-VV′) if orthogonality is enforced between individual structures. This is the first iteration.
4. Ai is obtained by a rank r_i singular value decomposition of X_Individual = (X-J)(I-VV′)∏k≠i(I-ViVi′) if the orthogonality constraint is imposed between individual structures. Save the right singular values (V_i).
5. Repeat steps 2–4 until the Frobenius norm of the difference between the current and previous iteration in both J and A is less than some threshold.
6. Return results (J, A, and the ranks used in the decomposition).

Algorithm 2: Similarity network fusion
Input: a similarity matrix W with Wi,j
Output: the final fused network Pfinal
1. ifj≠ithen
normalize the weight matrix Pi,j=Wi,j2∑k≠iWi,k
else
Pi,j=1/2
end if
ifj∈Nithen
Si,j=Wi,j∑k∈NiWi,k, where Ni is a set of neighbors for patient xi
else
Si,j=0
end if
2.obtainPv and Sv for the vth joint structure (v=1,2,⋯,r).
3. Iteratively update similarity network Pv=Sv×∑k≠vPkr-1×SvT
4.Pfinal=∑v=1rPvr
returnPfinal

54 in total

1. clusterProfiler: an R package for comparing biological themes among gene clusters.

Authors: Guangchuang Yu; Li-Gen Wang; Yanyan Han; Qing-Yu He
Journal: OMICS Date: 2012-03-28

2. Comprehensive analysis of the ICEN (Interphase Centromere Complex) components enriched in the CENP-A chromatin of human cells.

Authors: Hiroshi Izuta; Masashi Ikeno; Nobutaka Suzuki; Takeshi Tomonaga; Naohito Nozaki; Chikashi Obuse; Yasutomo Kisu; Naoki Goshima; Fumio Nomura; Nobuo Nomura; Kinya Yoda
Journal: Genes Cells Date: 2006-06 Impact factor: 1.891

3. Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas.

Authors: Daniel J Brat; Roel G W Verhaak; Kenneth D Aldape; W K Alfred Yung; Sofie R Salama; Lee A D Cooper; Esther Rheinbay; C Ryan Miller; Mark Vitucci; Olena Morozova; A Gordon Robertson; Houtan Noushmehr; Peter W Laird; Andrew D Cherniack; Rehan Akbani; Jason T Huse; Giovanni Ciriello; Laila M Poisson; Jill S Barnholtz-Sloan; Mitchel S Berger; Cameron Brennan; Rivka R Colen; Howard Colman; Adam E Flanders; Caterina Giannini; Mia Grifford; Antonio Iavarone; Rajan Jain; Isaac Joseph; Jaegil Kim; Katayoon Kasaian; Tom Mikkelsen; Bradley A Murray; Brian Patrick O'Neill; Lior Pachter; Donald W Parsons; Carrie Sougnez; Erik P Sulman; Scott R Vandenberg; Erwin G Van Meir; Andreas von Deimling; Hailei Zhang; Daniel Crain; Kevin Lau; David Mallery; Scott Morris; Joseph Paulauskis; Robert Penny; Troy Shelton; Mark Sherman; Peggy Yena; Aaron Black; Jay Bowen; Katie Dicostanzo; Julie Gastier-Foster; Kristen M Leraas; Tara M Lichtenberg; Christopher R Pierson; Nilsa C Ramirez; Cynthia Taylor; Stephanie Weaver; Lisa Wise; Erik Zmuda; Tanja Davidsen; John A Demchok; Greg Eley; Martin L Ferguson; Carolyn M Hutter; Kenna R Mills Shaw; Bradley A Ozenberger; Margi Sheth; Heidi J Sofia; Roy Tarnuzzer; Zhining Wang; Liming Yang; Jean Claude Zenklusen; Brenda Ayala; Julien Baboud; Sudha Chudamani; Mark A Jensen; Jia Liu; Todd Pihl; Rohini Raman; Yunhu Wan; Ye Wu; Adrian Ally; J Todd Auman; Miruna Balasundaram; Saianand Balu; Stephen B Baylin; Rameen Beroukhim; Moiz S Bootwalla; Reanne Bowlby; Christopher A Bristow; Denise Brooks; Yaron Butterfield; Rebecca Carlsen; Scott Carter; Lynda Chin; Andy Chu; Eric Chuah; Kristian Cibulskis; Amanda Clarke; Simon G Coetzee; Noreen Dhalla; Tim Fennell; Sheila Fisher; Stacey Gabriel; Gad Getz; Richard Gibbs; Ranabir Guin; Angela Hadjipanayis; D Neil Hayes; Toshinori Hinoue; Katherine Hoadley; Robert A Holt; Alan P Hoyle; Stuart R Jefferys; Steven Jones; Corbin D Jones; Raju Kucherlapati; Phillip H Lai; Eric Lander; Semin Lee; Lee Lichtenstein; Yussanne Ma; Dennis T Maglinte; Harshad S Mahadeshwar; Marco A Marra; Michael Mayo; Shaowu Meng; Matthew L Meyerson; Piotr A Mieczkowski; Richard A Moore; Lisle E Mose; Andrew J Mungall; Angeliki Pantazi; Michael Parfenov; Peter J Park; Joel S Parker; Charles M Perou; Alexei Protopopov; Xiaojia Ren; Jeffrey Roach; Thaís S Sabedot; Jacqueline Schein; Steven E Schumacher; Jonathan G Seidman; Sahil Seth; Hui Shen; Janae V Simons; Payal Sipahimalani; Matthew G Soloway; Xingzhi Song; Huandong Sun; Barbara Tabak; Angela Tam; Donghui Tan; Jiabin Tang; Nina Thiessen; Timothy Triche; David J Van Den Berg; Umadevi Veluvolu; Scot Waring; Daniel J Weisenberger; Matthew D Wilkerson; Tina Wong; Junyuan Wu; Liu Xi; Andrew W Xu; Lixing Yang; Travis I Zack; Jianhua Zhang; B Arman Aksoy; Harindra Arachchi; Chris Benz; Brady Bernard; Daniel Carlin; Juok Cho; Daniel DiCara; Scott Frazer; Gregory N Fuller; JianJiong Gao; Nils Gehlenborg; David Haussler; David I Heiman; Lisa Iype; Anders Jacobsen; Zhenlin Ju; Sol Katzman; Hoon Kim; Theo Knijnenburg; Richard Bailey Kreisberg; Michael S Lawrence; William Lee; Kalle Leinonen; Pei Lin; Shiyun Ling; Wenbin Liu; Yingchun Liu; Yuexin Liu; Yiling Lu; Gordon Mills; Sam Ng; Michael S Noble; Evan Paull; Arvind Rao; Sheila Reynolds; Gordon Saksena; Zack Sanborn; Chris Sander; Nikolaus Schultz; Yasin Senbabaoglu; Ronglai Shen; Ilya Shmulevich; Rileen Sinha; Josh Stuart; S Onur Sumer; Yichao Sun; Natalie Tasman; Barry S Taylor; Doug Voet; Nils Weinhold; John N Weinstein; Da Yang; Kosuke Yoshihara; Siyuan Zheng; Wei Zhang; Lihua Zou; Ty Abel; Sara Sadeghi; Mark L Cohen; Jenny Eschbacher; Eyas M Hattab; Aditya Raghunathan; Matthew J Schniederjan; Dina Aziz; Gene Barnett; Wendi Barrett; Darell D Bigner; Lori Boice; Cathy Brewer; Chiara Calatozzolo; Benito Campos; Carlos Gilberto Carlotti; Timothy A Chan; Lucia Cuppini; Erin Curley; Stefania Cuzzubbo; Karen Devine; Francesco DiMeco; Rebecca Duell; J Bradley Elder; Ashley Fehrenbach; Gaetano Finocchiaro; William Friedman; Jordonna Fulop; Johanna Gardner; Beth Hermes; Christel Herold-Mende; Christine Jungk; Ady Kendler; Norman L Lehman; Eric Lipp; Ouida Liu; Randy Mandt; Mary McGraw; Roger Mclendon; Christopher McPherson; Luciano Neder; Phuong Nguyen; Ardene Noss; Raffaele Nunziata; Quinn T Ostrom; Cheryl Palmer; Alessandro Perin; Bianca Pollo; Alexander Potapov; Olga Potapova; W Kimryn Rathmell; Daniil Rotin; Lisa Scarpace; Cathy Schilero; Kelly Senecal; Kristen Shimmel; Vsevolod Shurkhay; Suzanne Sifri; Rosy Singh; Andrew E Sloan; Kathy Smolenski; Susan M Staugaitis; Ruth Steele; Leigh Thorne; Daniela P C Tirapelli; Andreas Unterberg; Mahitha Vallurupalli; Yun Wang; Ronald Warnick; Felicia Williams; Yingli Wolinsky; Sue Bell; Mara Rosenberg; Chip Stewart; Franklin Huang; Jonna L Grimsby; Amie J Radenbaugh; Jianan Zhang
Journal: N Engl J Med Date: 2015-06-10 Impact factor: 91.245

4. A comparative study of multi-omics integration tools for cancer driver gene identification and tumour subtyping.

Authors: Anita Sathyanarayanan; Rohit Gupta; Erik W Thompson; Dale R Nyholt; Denis C Bauer; Shivashankar H Nagaraj
Journal: Brief Bioinform Date: 2020-12-01 Impact factor: 11.622

5. The Anticancer Effects of Garlic Extracts on Bladder Cancer Compared to Cisplatin: A Common Mechanism of Action via Centromere Protein M.

Authors: Won Tae Kim; Sung-Pil Seo; Young Joon Byun; Ho-Won Kang; Yong-June Kim; Sang-Cheol Lee; Pildu Jeong; Hye-Jin Song; Soo Young Choe; Dong-Joon Kim; Seon-Kyu Kim; Yun Sok Ha; Sung-Kwon Moon; Geun Taek Lee; Isaac Yi Kim; Seok Joong Yun; Wun-Jae Kim
Journal: Am J Chin Med Date: 2018-03-29 Impact factor: 4.667

6. Using association signal annotations to boost similarity network fusion.

Authors: Peifeng Ruan; Ya Wang; Ronglai Shen; Shuang Wang
Journal: Bioinformatics Date: 2019-10-01 Impact factor: 6.937

7. Similarity network fusion for aggregating data types on a genomic scale.

Authors: Bo Wang; Aziz M Mezlini; Feyyaz Demir; Marc Fiume; Zhuowen Tu; Michael Brudno; Benjamin Haibe-Kains; Anna Goldenberg
Journal: Nat Methods Date: 2014-01-26 Impact factor: 28.547

8. Prognostic relevance of genetic alterations in diffuse lower-grade gliomas.

Authors: Kosuke Aoki; Hideo Nakamura; Hiromichi Suzuki; Keitaro Matsuo; Keisuke Kataoka; Teppei Shimamura; Kazuya Motomura; Fumiharu Ohka; Satoshi Shiina; Takashi Yamamoto; Yasunobu Nagata; Tetsuichi Yoshizato; Masahiro Mizoguchi; Tatsuya Abe; Yasutomo Momii; Yoshihiro Muragaki; Reiko Watanabe; Ichiro Ito; Masashi Sanada; Hironori Yajima; Naoya Morita; Ichiro Takeuchi; Satoru Miyano; Toshihiko Wakabayashi; Seishi Ogawa; Atsushi Natsume
Journal: Neuro Oncol Date: 2018-01-10 Impact factor: 12.300

9. Human Cdc45 is a proliferation-associated antigen.

Authors: S Pollok; C Bauerschmidt; J Sänger; H-P Nasheuer; F Grosse
Journal: FEBS J Date: 2007-07-03 Impact factor: 5.542

10. Glioma Groups Based on 1p/19q, IDH, and TERT Promoter Mutations in Tumors.

Authors: Jeanette E Eckel-Passow; Daniel H Lachance; Annette M Molinaro; Kyle M Walsh; Paul A Decker; Hugues Sicotte; Melike Pekmezci; Terri Rice; Matt L Kosel; Ivan V Smirnov; Gobinda Sarkar; Alissa A Caron; Thomas M Kollmeyer; Corinne E Praska; Anisha R Chada; Chandralekha Halder; Helen M Hansen; Lucie S McCoy; Paige M Bracci; Roxanne Marshall; Shichun Zheng; Gerald F Reis; Alexander R Pico; Brian P O'Neill; Jan C Buckner; Caterina Giannini; Jason T Huse; Arie Perry; Tarik Tihan; Mitchell S Berger; Susan M Chang; Michael D Prados; Joseph Wiemels; John K Wiencke; Margaret R Wrensch; Robert B Jenkins
Journal: N Engl J Med Date: 2015-06-10 Impact factor: 176.079