Literature DB >> 30778371

Co-expression Network Analysis Identifies Four Hub Genes Associated With Prognosis in Soft Tissue Sarcoma.

Zhenhua Zhu¹, Zheng Jin², Yuyou Deng³, Lai Wei⁴, Xiaowei Yuan¹, Mei Zhang⁵, Dahui Sun¹.

Abstract

Background: Soft tissue sarcomas (STS) are heterogeneous tumors derived from mesenchymal cells that differentiate into soft tissues. The prognosis of patients who present with an STS is influenced by the regulation of a complex gene network.
Methods: Weighted gene co-expression network analysis (WGCNA) was performed to identify gene modules associated with STS (Samples = 156).
Results: Among the 11 modules identified, the black and blue modules were highly correlated with STS. However, using preservation analysis, the black module demonstrated low preservation, therefore the blue module was chosen as the module of interest. Furthermore, a total of 20 network hub genes were identified in the blue module, 12 of which were also hub nodes in the protein-protein interaction network of the module genes. Following additional verification, 4 of 12 genes (RRM2, BUB1B, CENPF, and KIF20A) demonstrated poorer overall survival and disease-free survival rate in the test datasets. In addition, gene set enrichment analysis (GSEA) demonstrated that samples with a high level of blue module eigengene (ME) were enriched in cell cycle and metabolism associated signaling pathways.
Conclusion: In summary, co-expression network analysis identified four hub genes associated with prognosis for STS, which may diminish the prognosis by influencing cell cycle and metabolism associated signaling pathways.

Entities: Chemical Disease Gene Species

Keywords: BUB1B; CENPF; KIF20A; RRM2; soft tissue sarcoma; weighted gene co-expression analysis

Year: 2019 PMID： 30778371 PMCID： PMC6369179 DOI： 10.3389/fgene.2019.00037

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

Introduction

Soft tissue sarcoma (STS) is a rare group of tumors that accounts for approximately 1% of adult cancers. In 2009, it was estimated that 3,300 new cases were diagnosed in Britain and 10,000 in the United States (Linch et al., 2014). There are approximately 50 STS subtypes, which differ significantly in their disease presentation, response to currently available treatments and risk of tumor progression (Casali et al., 2018). Multiple factors have been reported to be related to the progression of STS, including capillary morphogenesis gene 2 (CMG2) (Greither et al., 2017), HIF-2α protein (Nakazawa et al., 2016), epidermal growth factor receptor (EGFR) protein (Yang et al., 2017) and microRNAs (Smolle et al., 2017). However, no molecular biomarkers have been defined for predicting the prognosis of the disease in clinical. Therefore, a better understanding of the molecular pathogenesis is required. To date, microarray-based expression data have been used to identify genes related to tumor progression and prognosis. Takahashi et al. (2014) identified 25 survival-associated genes using a knowledge-based filtering and multiple testing approach. Beck et al (2010) has reviewed the manner in which gene expression profiling has been used to understand sarcoma pathobiology and identify clinically useful biomarkers. However, most studies have focused on screening genes that have different patterns of expression with explanations gained from gene ontology (GO) analysis. Such approaches, however, have failed to address the large number of interconnections between genes, because genes with similar expression profiles are most likely to function closely together. Therefore, weighted gene co-expression network analysis (WGCNA) clusters genes co-expressed in a network, based on similarities in expression profiles among samples and in clinical traits, to define sub-network regions (known as modules) (Langfelder and Horvath, 2008). In this study, we utilized WGCNA to identify the most relevant module in STS. Key genes in the module were identified and validated using survival and protein-protein interaction (PPI) analyses. These key genes may shed new light on the biological mechanisms underlying STS progression and could potentially be used as prognostic biomarkers or therapeutic targets.

Materials and Methods

Study Design and Data Collection

Study design, data preparation, preprocessing, analysis and validation are described in a flowchart (Figure 1). Core codes used to reproduce the results were provided in Supplementary Table S1. Firstly, normalized RNAseq data and associated clinical data were downloaded from the NCBI Gene Expression Omnibus (GEO). Dataset GSE21122 (Barretina et al., 2010), which was generated using an Affymetrix human genome U133A microarray (HG-U133A), was used as a training set to construct the co-expression network and identify key modules in this study. This dataset included 149 STS samples and 9 normal fat tissue samples. The STS samples contained 116 different types of liposarcoma and 34 malignant fibrous histiocytomas (MFHs). Most STSs (68.8%) were primary tumors at the time of sample procurement from patients whose mean age was 56 years. In addition, two test datasets were used to test the preservation of identified modules and survival significance of hub genes. The first one, which included RNA sequencing data and associated clinical information of 265 STS samples, were downloaded from The Cancer Genome Atlas (TCGA) database[1]. The other one, GSE21050 dataset (Chibon et al., 2010), which included RNA sequencing data and associated clinical information of 310 STS samples were downloaded from the NCBI GEO.

FIGURE 1

Flow diagram of strategy for data preparation, preprocessing and analysis used in this study.

Data Preprocessing

Firstly, we extracted training expression data from the GSE21122 MINiML file. The expression data was background corrected using the Robust Multi-array Average (RMA) algorithm and log base 2 normalized. The data were then checked to ascertain whether there was a batch effect. No apparent batch effect was observed after analysis of expression clusters, box plots and principal components analysis (PCA) (Supplementary Figure S1). In order to detect outliers for WGCNA analysis, sample network was calculated based on squared Euclidean distance. The connectivity of each sample was defined as the sum of the connectivity of that sample with all other samples. Outliers were identified after normalization of the connectivity of each sample, by use of the threshold z.k < 0.6. Generally, genes whose expression varies greatly are more biologically relevant. To reduce background noise, we selected genes that were varied expressed across samples and removed those whose expression was the same across samples. The median absolute deviation (MAD) was calculated for each gene as a robust measure of variability. Then, genes were sorted based on the MAD value and the top 3,000 ranked genes were used for the subsequent WGCNA analysis.

Co-expression Network Construction and Module Preservation Analysis

The WGCNA package (Langfelder and Horvath, 2008) was used to construct the co-expression network. The concordance of genes in the expression dataset was measured with Pearson correlation, then the Pearson correlation matrix was transformed to weighted network with the power adjacency function. The first step in this process was selection of an appropriate soft power, in which strong connections between genes are promoted and weak connections penalized, so as to transform the network into one meeting the requirements of a scale-free network. Modules were identified using the dynamic tree-cutting function with a deepSplit argument value of 2 and a minimum size cutoff of 30. To test whether the identified modules were stable in the test TCGA dataset, the downloaded fragments per million (FPKM) expression data of 265 samples were transformed to the transcripts per million (TPM). A total of 2704 common genes in the training and TCGA datasets were used for preservation analysis. The module Preservation function (nPermutations = 200) of the WGCNA package (Langfelder et al., 2011) was utilized, in which the preservation statistic Zsummary was used to quantify the preservation of gene modules between datasets.

Finding Modules of Interest and Functional Annotation

Because the module eigengene (ME) provides the most appropriate synopsis of gene expression profiles of any given module, we correlated MEs with clinical traits. In this study, clinical traits refer to whether the sample was a STS or normal fat tissue. Correlations were then calculated using linear regression model. The modules for which the eigengenes showed high correlation were chosen as the modules of interest. In an attempt to ascertain possible mechanisms of genes within a module affecting STS progression, functional enrichment analyses using the KEGG and GO databases of the hub module was performed with the “clusterProfile” package in R (Yu et al., 2012).

Identification of Hub Genes and Correlation Analysis

Hub genes are those that have a high degree of intra-module connectivity. In this study, hub genes were defined as the 20 module genes with highest connectivity in the interested module. A PPI network was constructed in order to identify hub nodes by uploading all genes in the hub module to the Search Tool for the Retrieval of Interacting Gene (STRING) database[2]. The PPI network was then imported into the Cytoscape software platform and a comprehensive analysis of the relationship between nodes was performed using the Maximal Clique Centrality (MCC) function, reported to be the most effective method of finding hub nodes in a co-expression network (Chin et al., 2014), within the “cytoHubba” application. In this way, the most cohesive genes were marked as “first stage nodes.” In the PPI network of blue module genes, the 30 most highly ranked nodes were identified as “first stage nodes.” Genes that were defined as both hub genes in the module and “first stage nodes” in the PPI network were chosen as primary hub genes.

Survival Analysis and Efficacy Evaluation

The internet tool, Gene Expression Profiling Interactive Analysis (GEPIA)[3], was used to perform overall survival and disease-free survival analyses for all hub genes. The platform utilizes all expression data and survival information of the TCGA database. Users are able to accomplish survival analysis by simply submitting a gene name and selecting a tumor type. Patients were divided into two groups (high vs. low) based on the hub gene expression level in comparison to the mean expression level of that hub gene. Furthermore, dataset GSE21050, which includes 310 STS samples in which metastasis status and survival time were provided, was used to test the significance of hub genes for metastasis survival. A Kaplan-Meier survival plot was constructed using the “survival” package in R (Li, 2003). Differential expression between STS and normal tissue in the training set was plotted as a box plot graph.

Gene Set Enrichment Analysis (GSEA)

In the training data set, 156 samples were dichotomized into two groups (High vs. Low) based on the ME value of blue module in comparison to the mean ME level of blue module of all samples. GSEA was then performed between the two groups. The 3,000 most variable genes from the WGCNA were imported for enrichment. In this way, GSEA was used to validate the results of GO and KEGG analysis of the blue module. The cut-off criterion for GSEA was FDR < 0.05.

Results

After discarding two outlier samples (GSM528297 and GSM528333), WGCNA was performed on the 3,000 most variable genes of 156 samples. Soft threshold power was set to 6, in which R2 was 0.916, ensured a scale-free network (Figure 2). Following this, 11 co-expression modules were identified, ranging in size from 43 to 669 genes (with each module assigned a color) (Figure 3).

FIGURE 2

FIGURE 3

Color coding of co-expression network modules for mRNAs.

Determination of soft-thresholding power in the weighted gene co-expression network analysis (WGCNA). (A) Analysis of scale-free fit index for various soft-thresholding powers (β). (B) Analysis of mean connectivity for various soft-thresholding powers. (C) Linear model fitting of R2 index showed good quality of fit. (D) Frequency distribution of connectivity. Color coding of co-expression network modules for mRNAs. By comparing the training dataset GSE21122 with the TCGA test dataset, we were able to establish whether the co-expression modules produced in the training dataset could be reproduced in the test dataset through summary preservation statistics. Three modules (black, brown, and magenta) demonstrated poor preservation with each Zsummary statistic < 10. The remaining modules, including the blue module were stable enough, suggesting they were preserved between the training data set and the test data set (Figure 4).

FIGURE 4

medianRank and Zsummary statistics of the most variant gene modules in module preservation. In the preservation medianRank graph (left), a medianRank value close to zero indicates a high degree of module preservation. In the preservation Zsummary graph (right), the dashed black lines indicate the thresholds Z = 2, 10. These horizontal lines indicate Zsummary thresholds for strong evidence of conservation (above 10) and for low to moderate evidence of conservation (above 2). It is important to identify the most significant modules related to STS. Both black and blue modules showed a significantly high correlation with sarcomas (Figure 5, 6). However, due to the lack of stability of the statistical data (Zsummary < 10), the black module was not further analyzed. Therefore, the blue module was defined as an important module of clinical significance and extracted for further analysis.

FIGURE 5

Heat map of correlation between eigengene modules and STS.

FIGURE 6

Scatter plot of eigengene modules in the blue and black modules.

Heat map of correlation between eigengene modules and STS. Scatter plot of eigengene modules in the blue and black modules. For the sake of exploration of the biological relevance of the blue module, GO functional and KEGG pathway enrichment analyses were performed on 414 genes in the blue module. The biological processes of the genes in the blue module were found to associate with the cell cycle, such as mitotic nuclear division, chromosome segregation and sister chromatid segregation. In the KEGG pathway analysis, cell cycle associated signaling pathways such as DNA replication, cell cycle, p53 signaling pathway, oocyte meiosis, mismatch repair and metabolism associated pathways such as pyrimidine metabolism and purine metabolism were enriched (Figure 7).

FIGURE 7

Bioinformatic analysis of genes in the blue module. GO analysis: (A) Biological process. (B) Cellular component. (C) Molecular function. KEGG analysis:(D) Pathway analysis.

Identification of Sarcoma Hub Genes in the Blue Module

Highly connected hub genes within a module perform important roles in tumor biological processes. Therefore, the 20 genes with greatest module relevance in the blue module were selected as candidate hub genes for STS (Supplementary Data Sheet S1). In addition, a PPI network in the blue module was constructed in accordance with the STRING database (Figure 8). Twelve of the 20 candidate genes in the co-expression network were also identified as hub nodes of the PPI network. Finally, these 12 genes were considered “primary” hub genes associated with STS and therefore selected for additional analyses.

FIGURE 8

Protein-protein interaction network of the top 30 genes in the blue module (Node color: deeper colors indicates higher scores in the MCC analysis).

Protein-protein interaction network of the top 30 genes in the blue module (Node color: deeper colors indicates higher scores in the MCC analysis). While testing the TCGA dataset, four out of 12 hub genes demonstrated significant connectivity with overall and disease-free survival (Figure 9). When testing the GSE21050 dataset, these four hub genes showed significant correlation with metastasis free survival (Figure 10). Furthermore, they were significantly highly expressed in STS tissue compared to normal fat tissue (Figure 11).

FIGURE 9

FIGURE 10

Survival analysis of association between RRM2, BUB1B, CENPF, and KIF20A expression levels and metastasis-free survival rates in STS based on GSE21050 microarray data.

FIGURE 11

RRM2, BUB1B, CENPF, and KIF20A were strongly upregulated in STS tissues compared to normal fat tissue, based on GSE21122 microarray data. ∗∗p < 0.01.

Survival analysis of association between RRM2, BUB1B, CENPF, and KIF20A expression levels and survival rates in STS based on TCGA microarray data. (A) Overall survival analysis. (B) Disease free survival. Survival analysis of association between RRM2, BUB1B, CENPF, and KIF20A expression levels and metastasis-free survival rates in STS based on GSE21050 microarray data. RRM2, BUB1B, CENPF, and KIF20A were strongly upregulated in STS tissues compared to normal fat tissue, based on GSE21122 microarray data. ∗∗p < 0.01.

Gene Set Enrichment Analysis

In order to find out the potential function of both blue module and hub genes, GSEA was performed to identify KEGG pathways enriched in samples with higher level of ME of blue module. In GSEA analysis, five signaling pathways were significantly enriched, including ubiquitin mediated proteolysis (FDR = 0.01), pyrimidine metabolism (FDR = 0.03), oocyte meiosis (FDR = 0.02), cell cycle (FDR = 0.04) and DNA replication (FDR = 0.04) (Figure 12). Moreover, the last four pathways were consistent with the results of KEGG pathway analysis (Figure 7D).

FIGURE 12

Gene set enrichment analysis (GSEA). Cell cycle and metabolism associated pathways were enriched.

Discussion

Soft tissue sarcomas remain among the most challenging diseases for medical oncologists to treat. STSs are mesenchymal neoplasms that can arise from any site within the body, including extremities, the trunk, retroperitoneum, head, and neck. These are biologically heterogeneous diseases of which greater than 50 subtypes exist, varying by molecular, histological and clinical characteristics. In this study, WGCNA was utilized to construct a co-expression network for identification of gene co-expression modules associated with STS. The blue module was positively identified and 20 hub genes selected from this module. In addition, as a result of the PPI network, 12 genes were identified as hub nodes of the co-expression module and PPI network, indicating that these 12 hub genes were closely related to STS and had important biological significance. Subsequent survival analysis established that four of the 12 hub genes (RRM2, BUB1B, CENPF, and KIF20A) were significantly associated with survival. We, therefore, focused on these four genes. The ribonucleotide reductase regulatory subunit M2 (RRM2) is one of two subunits that constitute ribonucleotide reductase, the enzyme responsible for catalyzing the conversion of ribonucleotides into deoxyribonucleotides, and thus performing an important role in DNA synthesis. RRM2 is important in controlling cellular function in a number of human malignant tumors, including DNA repair, cell proliferation and senescence. Importantly, RRM2 functions as a driver in a variety of tumors, with in vivo and in vitro experiments confirming that knocking down expression using siRNA significantly inhibits tumor cell proliferation (Fang et al., 2016). The BUB1 mitotic checkpoint serine/threonine kinase B (BUB1B) is a member of the spindle assembly checkpoint protein family, crucial for ensuring correct chromosome separation during cell division (Fu et al., 2016). BUB1B perfoms a role in the inhibition of APC expression, established as a tumor suppressor gene in most colorectal cancers. Accordingly, many reports have shown that upregulation of BUB1B is related to the recurrence and progression of bladder cancer (Yamamoto et al., 2007), gastric cancer (Ando et al., 2010), esophageal squamous cell carcinoma (Tanaka et al., 2008), breast cancer (Yuan et al., 2006), hepatocellular carcinoma (Liu et al., 2009) and others. Centromere protein F (CENPF) is another important protein involved in chromosome segregation during mitosis. Upregulation of CENPF protein expression, especially through a gene amplification effect, suggests that high levels of CENPF protein may affect the occurrence of tumors, especially in the early stages of tumor development (Varis et al., 2006). Clinical research has demonstrated that high expression levels of CENPF results in poor prognosis in nasopharyngeal carcinoma (Cao et al., 2010), colorectal gastrointestinal stromal tumors (Chen et al., 2011), esophageal squamous cell carcinoma (Mi et al., 2013) and prostate cancer (Zhuo et al., 2015). It has also been shown to play an important role in driving hepatocellular carcinoma (Dai et al., 2013). Kinesin family member 20A (KIF20A, also known as RAB6KIFL) belongs to the kinesin superfamily-6, located in the Golgi apparatus and contributes to intracellular organelle transport and cell division (Echard et al., 1998). Recently, it has been reported that KIF20A is associated with mitosis, cell adhesion, migration and proliferation. Furthermore, recent studies have demonstrated that KIF20A is involved in tumor progression and angiogenesis. High expression of KIF20A results poor prognosis in glioma patients (Duan et al., 2016; Saito et al., 2017), nasopharyngeal cancer (Liu et al., 2017), hepatocellular carcinoma (Shi et al., 2016), melanoma (Yamashita et al., 2012) and early-stage cervical squamous cell carcinoma (Zhang et al., 2016). Regarding GSEA, it was found that cell cycle and metabolism associated pathways were significant enriched in samples with higher level of ME of blue module. This is consistent with the initial GO and KEGG analysis results of the blue module and are related to the physiological function of these four hub genes. In summary, through WGCNA and other related analysis methods, we identified four genes (RRM2, BUB1B, CENPF, and KIF20A) related to the progression and prognosis of STS. These genes may play a role by regulating the cell cycle and metabolism associated signaling pathways.

Author Contributions

ZZ and DS designed the study. ZZ and ZJ performed the data collection. ZJ and LW performed the data analysis. ZZ and MZ drafted the manuscript. All authors read and approved the final version of the manuscript.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

16 in total

1. Prognostic significance of KIF2A and KIF20A expression in human cancer: A systematic review and meta-analysis.

Authors: Xing Li; Kunpeng Shu; Zhifeng Wang; Degang Ding
Journal: Medicine (Baltimore) Date: 2019-11 Impact factor: 1.817

2. Improving existing analysis pipeline to identify and analyze cancer driver genes using multi-omics data.

Authors: Quang-Huy Nguyen; Duc-Hau Le
Journal: Sci Rep Date: 2020-11-25 Impact factor: 4.379

3. The role of gene to gene interaction in the breast's genomic signature of pregnancy.

Authors: Pedro J Gutiérrez-Díez; Javier Gomez-Pilar; Roberto Hornero; Julia Martínez-Rodríguez; Miguel A López-Marcos; Jose Russo
Journal: Sci Rep Date: 2021-01-29 Impact factor: 4.379

4. The Oncolytic Activity of Myxoma Virus against Soft Tissue Sarcoma Is Mediated by the Overexpression of Ribonucleotide Reductase.

Authors: Yanghee Woo; Susanne G Warner; Rula Geha; Marianne M Stanford; Penelope Decarolis; Masmudur M Rahman; Samuel Singer; Grant McFadden; Yuman Fong
Journal: Clin Med Insights Oncol Date: 2021-02-11

5. piR-39980 mediates doxorubicin resistance in fibrosarcoma by regulating drug accumulation and DNA repair.

Authors: Basudeb Das; Neha Jain; Bibekanand Mallick
Journal: Commun Biol Date: 2021-11-19

6. Identification of key gene modules and hub genes of human mantle cell lymphoma by coexpression network analysis.

Authors: Dongmei Guo; Hongchun Wang; Li Sun; Shuang Liu; Shujing Du; Wenjing Qiao; Weiyan Wang; Gang Hou; Kaigang Zhang; Chunpu Li; Qingliang Teng
Journal: PeerJ Date: 2020-03-20 Impact factor: 2.984

7. Construction and validation of an eight-gene signature with great prognostic value in bladder cancer.

Authors: Xin Yan; Xun Fu; Zi-Xin Guo; Xiao-Ping Liu; Tong-Zu Liu; Sheng Li
Journal: J Cancer Date: 2020-01-17 Impact factor: 4.207

8. IRF1 association with tumor immune microenvironment and use as a diagnostic biomarker for colorectal cancer recurrence.

Authors: Yanfang Wu; Shuju Zhang; Jun Yan
Journal: Oncol Lett Date: 2020-01-10 Impact factor: 2.967

9. Identification of Long Noncoding RNAs as Predictors of Survival in Triple-Negative Breast Cancer Based on Network Analysis.

Authors: Xiao-Xiao Li; Li-Juan Wang; Jie Hou; Hong-Yang Liu; Rui Wang; Chao Wang; Wen-Hai Xie
Journal: Biomed Res Int Date: 2020-03-03 Impact factor: 3.411

10. Weighted gene coexpression network analysis identifies the key role associated with acute coronary syndrome.

Authors: Yong Wang; Liu Miao; Lin Tao; Jian-Hong Chen; Chuan-Meng Zhu; Ye Li; Bin Qi; Fei Liao; Rong-Shan Li
Journal: Aging (Albany NY) Date: 2020-10-14 Impact factor: 5.682