Bingli Zheng1, Jie Peng2, Ablikim Mollayup1, Ahmat Bakri1, Lei Guo1, Jianjiang Zheng1, Hui Xu1. 1. Department of Pancreatic Surgery, Xinjiang Uygur Autonomous Region People's Hospital, Ürümqi, Xinjiang 830002, P.R. China. 2. Emergency Department, Traditional Chinese Medicine Hospital of Xinjiang Medical University, Ürümqi, Xinjiang 830000, P.R. China.
Abstract
Pancreatic cancer (PC) is associated with high mortality rates and poor prognoses. Pancreatic adenocarcinoma is the most common type of PC, and almost all cases of pancreatic adenocarcinoma are pancreatic ductal adenocarcinoma (PDAC). The aim of the current study was to reveal the genes involved in the prognosis of PDAC. Five datasets, including GSE71729 (145 PDAC samples and 46 normal samples), GSE15471 (39 PDAC samples and 39 normal samples), GSE1542 (24 PDAC samples and 25 normal samples), GSE28735 (45 PDAC samples and 45 normal samples) and GSE62452 (69 PDAC samples and 69 normal samples) were downloaded from the Gene Expression Omnibus database. Using the MetaDE.ES method in the MetaDE package, differentially expressed genes (DEGs) were identified from the five datasets. Furthermore, prognosis‑associated genes were screened using the Cox regression analysis in the survival package, and co‑expression network and module analyses were performed separately using Cytoscape software and GraphWeb tool, respectively. After a prognostic prediction system was constructed and validated, enrichment analysis of the signature genes was performed using the clusterProfiler package. A total of 480 DEGs were identified from the five datasets and 259 prognosis‑associated genes were screened from GSE28735 and GSE62452. In addition, the prognostic prediction system composed of 67 signature genes [including basic transcription factor 3 (BTF3), serine/threonine kinase 11 (STK11), thrombospondin 1 (THBS1), ribosomal protein L38 (RPL38) and secretin receptor (SCTR)] was constructed and validated. The signature genes involved in the co‑expression network were enriched in five pathways. In particular, STK11 was involved in three signaling pathways, and THBS1 was enriched in the phosphoinositide 3‑kinase‑Akt signaling pathway. Thus, BTF3, STK11, THBS1, RPL38 and SCTR may influence the prognosis of PDAC.
Pancreatic cancer (PC) is associated with high mortality rates and poor prognoses. Pancreatic adenocarcinoma is the most common type of PC, and almost all cases of pancreatic adenocarcinoma are pancreatic ductal adenocarcinoma (PDAC). The aim of the current study was to reveal the genes involved in the prognosis of PDAC. Five datasets, including GSE71729 (145 PDAC samples and 46 normal samples), GSE15471 (39 PDAC samples and 39 normal samples), GSE1542 (24 PDAC samples and 25 normal samples), GSE28735 (45 PDAC samples and 45 normal samples) and GSE62452 (69 PDAC samples and 69 normal samples) were downloaded from the Gene Expression Omnibus database. Using the MetaDE.ES method in the MetaDE package, differentially expressed genes (DEGs) were identified from the five datasets. Furthermore, prognosis‑associated genes were screened using the Cox regression analysis in the survival package, and co‑expression network and module analyses were performed separately using Cytoscape software and GraphWeb tool, respectively. After a prognostic prediction system was constructed and validated, enrichment analysis of the signature genes was performed using the clusterProfiler package. A total of 480 DEGs were identified from the five datasets and 259 prognosis‑associated genes were screened from GSE28735 and GSE62452. In addition, the prognostic prediction system composed of 67 signature genes [including basic transcription factor 3 (BTF3), serine/threonine kinase 11 (STK11), thrombospondin 1 (THBS1), ribosomal protein L38 (RPL38) and secretin receptor (SCTR)] was constructed and validated. The signature genes involved in the co‑expression network were enriched in five pathways. In particular, STK11 was involved in three signaling pathways, and THBS1 was enriched in the phosphoinositide 3‑kinase‑Akt signaling pathway. Thus, BTF3, STK11, THBS1, RPL38 and SCTR may influence the prognosis of PDAC.
Pancreatic cancer (PC), which arises when pancreatic cells proliferate abnormally, is characterized by yellow skin, weight loss, back or abdominal pain, dark urine, light-colored stools, and appetite loss (1). The early symptoms of PC are not obvious, thus newly diagnosed PC cases have usually reached an advanced stage (2). PC is primarily induced by tobacco smoking, diabetes, obesity and genetic conditions (3,4). Globally, PC is the seventh leading cause of cancer-associated mortality, resulting in 330,000 fatalities in 2012 (5). The prognosis of PC is usually very poor, with 5% of people surviving for five years and 25% surviving just one year after diagnosis (5,6). Pancreatic adenocarcinoma is the most common type of PC and accounts for 85% of all PC cases (3). Almost all cases of pancreatic adenocarcinoma originate from the ducts of the pancreas, and are termed pancreatic ductal adenocarcinoma (PDAC) (7). Therefore, revealing the underlying mechanisms of PDAC is significant for developing novel treatments.Yamazaki et al (8) demonstrated that SMAD family member 3 contributes to the malignant potential of PDAC by inducing epithelial-mesenchymal transition (EMT) in tumor cells and thus presents as a promising biomarker of poor prognosis (8). Masugi et al (9) identified that integrin β4 functions in regulating EMT and cancer invasion, and its overexpression has clinicopathological and prognostic significance in PDAC (9). Sex determining region Y-box 9 (SOX9) and phosphorylated-v-Aktmurinethymoma viral oncogene homolog 1 (p-AKT) are reported to be associated with proliferation and distant metastasis, indicating that SOX9 and p-AKT may potentially predict the prognosis of PDAC (10). The downregulation of anterior gradient 2 is induced by EMT and serves as a novel prognostic marker in patients with PDAC (11). In PDACpatients who undergo pancreaticoduodenectomy, high expression levels of transforming growth factor β-1 may inhibit the poorer survival that is associated with high proliferation (12). However, the genes involved in the prognosis of PDAC have not been fully reported.In the current study, public databases were searched for microarray datasets associated with PDAC in homo sapiens. Subsequently, comprehensive bioinformatics analyses [including identification of differentially expressed genes (DEGs) and prognosis-associated genes, co-expression network and module analyses, construction and validation of prognostic prediction system, and enrichment analysis] were performed successively to investigate the key prognosis-associated genes in PDAC. The current study may contribute to developing targeted therapeutic strategies for improving the prognosis of PDAC.
Data and methods
Expression profile data
Using ‘Pancreatic ductal adenocarcinoma’ and ‘homo sapiens’ as keywords, the gene expression profiles in the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) database were searched. The datasets that met the following criteria were included: i) The dataset was associated with PDAC; ii) the dataset included PDAC tissue samples and normal control samples. Finally, a total of five datasets were included in the current study, including GSE71729 (based on the GPL20769 platform, including 145 PDAC samples and 46 normal samples), GSE15471 (based on the GPL570 platform, including 39 PDAC samples and 39 normal samples), GSE1542 (based on the GPL96 platform, including 24 PDAC samples and 25 normal samples), GSE28735 (based on the GPL9644 platform, including 45 PDAC samples and 45 normal samples) and GSE62452 (based on the GPL9644 platform, including 69 PDAC samples and 69 normal samples).
Data preprocessing
Using the Affy package (http://www.bioconductor.org/packages/release/bioc/html/affy.html) (13) in R language, background correction and normalization of raw data of GSE15471 and GSE1542 were conducted. For the datasets of GSE71729, GSE28735 and GSE62452, probes were corresponded to gene symbols based upon platform annotation information. After the unloaded probes were removed, the average value of the probes mapped to the same gene was obtained as the initial gene expression value. Finally, the data were normalized by the linear models for microarray data using the limma package (http://www.bioconductor.org/packages/release/bioc/html/limma.html) (14) in R language.
Meta-analysis to screen characteristic factors
Using the quality control standards [including external quality control, internal quality control, consistency quality control (CQCg and CQCp) and accuracy quality control (AQCg and AQCp)] in the MetaQC package (15), quality control was performed for the datasets. To identify reliable datasets, the datasets were further assessed and screened using the two-dimensional diagram of principal component analysis (PCA) and standardized mean rank scores. Based on the MetaDE.ES method in the MetaDE package (15), homogeneous unbiased genes were identified using a heterogeneity test [thresholds: Tau2=0 and Qpval (statistical parameter representing heterogeneity of the dataset) >0.05] and the DEGs were screened [threshold: False discovery rate (FDR) <0.05].
Identification of prognosis-associated genes
Among the included datasets, GSE28735 and GSE62452 were based upon the GPL9644 platform and contained sample survival information. Thus, the data of the two datasets were merged and prognosis-associated genes were identified using the Cox regression analysis in survival package, with P<0.05 set as the significant threshold (https://github.com/therneau/survival). After significant P-values were obtained using the log-rank test, Kaplan-Meier (KM) survival analysis was performed on the top six genes with higher-logRank (P-values) (16).
Co-expression network and module analyses
The expression values of the prognosis-associated genes were extracted from the datasets and the COR function (17) in R language was used to calculate their correlation coefficients. The co-expression pairs with correlation coefficient |r|≥0.6 and P<0.05 were selected for constructing a co-expression network using Cytoscape software (version 3.5.0; http://www.cytoscape.org/) (18). Furthermore, module analysis was performed for the co-expression network using the GraphWeb tool (http://biit.cs.ut.ee/graphweb/) (19).
Construction of a prediction and discrimination system for prognosis
The PDAC samples in GSE28735 and GSE62452 had survival information, and thus were taken as the training dataset for the prediction and discrimination system of prognosis. Firstly, the samples were divided into alive and deceased groups according to their survival states. Secondly, the samples were further divided into groups with bad prognosis (deceased and alive survival time <15 months) and good prognosis (alive and alive survival time ≥15 months). After the genes in the co-expression network were sorted in descending order according to their-logRank (P-values), Baye's discriminant analysis was performed using the discriminant.bayes function in the e1071 package (20). Genes were added one by one, and the genes affecting discrimination accuracy were removed until the highest discrimination accuracy was obtained. Under the highest discrimination accuracy, the discrimination coefficient of each sample, gene combination, and discrimination system were defined as the prognostic score, signature gene and prognostic prediction system, respectively.
Validation of the prognostic prediction system
To detect the effect of the prognostic prediction system, KM survival analysis (16) was performed for GSE28735 and GSE62452 to verify the correlation between the sample types recognized by the prognostic prediction system, and the actual survival time and states. In addition, the microarray data under E-MEXP-2780 (downloaded in January 5, 2017; including 30 PDAC samples containing survival information) were downloaded from the European Bioinformatics Institute (http://www.ebi.ac.uk/) database and used as an independent validation dataset for the prognostic prediction system. In addition, the PDAC dataset (downloaded in January 5, 2017; based on Illumina HiSeq 2000 RNA Sequencing platform; including 183 PDAC samples, among which 163 samples had survival information) in The Cancer Genome Atlas (TCGA; https://cancergenome.nih.gov/) database was also downloaded. Subsequently, the expression values of the signature genes were extracted from E-MEXP-2780 and the PDAC dataset was downloaded from the TCGA database. After prognostic scores were obtained based on the prognostic prediction system, the samples in E-MEXP-2780 and the PDAC dataset downloaded from the TCGA database were divided into groups with bad prognosis and good prognosis. Finally, KM survival analysis (16) was performed to analyze the correlation between the sample types recognized by the prognostic prediction system, and the actual survival time and states.
Construction of the co-expression network for the signature genes and identification of the key genes
The co-expression pairs of the signature genes were identified from the above co-expression network, and the co-expression networks for the signature genes were visualized. Gene Ontology (GO) (21) functional analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway (22) enrichment analysis were performed on genes in the co-expression network using the clusterProfiler package (http://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) (23) in R. To further identify the key genes associated with PDAC, the enriched terms were integrated into the co-expression network and the genes involved in multiple terms were selected.
Results
Identification of characteristic factors and prognosis-associated genes
After the raw data of the five datasets were normalized, quality control was further performed for the datasets (Fig. 1; Table I). The two-dimensional diagram of PCA for the five datasets is presented in Fig. 2. As a result, all five datasets were in accordance with the quality control standards and were included for the subsequent analysis. Based on the MetaDE.ES method, a total of 480 DEGs were identified from the five datasets.
Figure 1.
Data analysis flow diagram.
Table I.
Results of quality control measures and SMRs.
Dataset
IQC
EQC
CQCg
CQCp
AQCg
AQCp
SMR
GSE71729
3.85
4.83
264.65
133.86
32.71
82.01
2.42
GSE15471
4.96
3.09
275.15
70.62
39.49
39.38
3.18
GSE1542
4.09
4.34
242.36
103.51
101.54
58.61
3.92
GSE28735
5.11
3.21
310.21
90.31
18.24
29.18
3.17
GSE62452
5.42
3.23
307.54
75.54
19.03
29.94
3.08
IQC, internal quality control; EQC, external quality control; CQCg, consistency quality control for genes; CQCp, consistency quality control for pathways; AQCg, accuracy quality control for genes; AQCp, accuracy quality control for pathways; SMR, standardized mean rank.
Figure 2.
Two-dimensional diagram of principal component analysis for the five datasets. EQC, external quality control; AQCp, accuracy quality control for pathways; CQCp, consistency quality control for pathways; IQC, internal quality control; AQCg, accuracy quality control for genes; CQCg, consistency quality control for genes.
GSE28735 and GSE62452 were merged and a total of 259 prognosis-associated genes were screened. Subsequently, KM survival analysis was performed with the top six prognosis-associated genes, including transketolase (TKT), basic transcription factor 3, (BTF3), myomesin 1 (MYOM1), neurexophilin 4 (NXPH4); serine/threonine kinase 24 (STK24), transducin β-like 3 (TBL3) with the highest-logRank (P-values). The KM survival curves are presented in Fig. 3.
Figure 3.
Kaplan-Meier survival curves for the top six prognosis-associated genes with higher -logRank (P-values): (A) TKT, (B) BTF3, (C) MYOM1, (D) NXPH4, (E) STK24 and (F) TBL3. The red and black lines represent samples with high and low expression levels, respectively. HR, hazard ratio; TKT, transketolase; BTF3, basic transcription factor 3; MYOM1, myomesin 1; NXPH4, neurexophilin 4; STK24, Serine/threonine-protein kinase 24; TBL3, transducin β-like protein 3.
Under |r|≥0.6 and P<0.05, the co-expression pairs among the prognosis-associated genes were identified and the co-expression network (which had 213 nodes and 1,984 edges) was constructed (Fig. 4). Subsequently, a total of six modules were identified from the co-expression network. Furthermore, GO functional enrichment analysis was performed for the genes involved in each of the modules. Additionally, the most significant GO terms enriched for purple, yellow, cyan, blue, green, and red modules separately were cell cycle arrest (P=1.45e-03), signaling (P=2.34e-02), regulation of signal transduction (P=2.78e-03), signal transduction (P=9.84e-04), cellular process (P=3.57e-04), and developmental growth (P=1.36e-03).
Figure 4.
Co-expression network constructed for the prognosis-associated genes. Red and blue lines indicate that correlation coefficients were positive and negative, respectively. The sections within the black squares are modules identified from the network, and the most significant Gene Ontology terms enriched for it are noted in the diagram.
Construction and validation of a prognostic prediction system
A total of 107 PDAC samples in GSE28735 and GSE62452 had survival information, and were divided into groups with good (52 samples) and bad (55 samples) prognoses. According to the process in Fig. 5A, the prognostic prediction system composed of 67 signature genes [including BTF3, serine/threonine kinase 11 (STK11), thrombospondin 1 (THBS1), ribosomal protein L38 (RPL38) and secretin receptor, (SCTR)] was finally constructed. The area under the receiver operating characteristic (ROC) curve (AUC) demonstrating the discriminant accuracy of the prognostic prediction system is presented in Fig. 5B. In addition, a discriminant scoring system was constructed for the prognostic prediction system, which was as follows:
Figure 5.
(A) Process for constructing a prognostic prediction system based upon the GSE28735 and GSE62452 datasets. (B) AUC showing the discriminant accuracy of the prognostic prediction system. (C) The Kaplan-Meier survival curve for detecting the effect of the prognostic prediction system based upon GSE28735 and GSE62452. AUC, area under the receiver operating characteristic curve.
KM survival analysis was performed for GSE28735 and GSE62452 to detect the effect of the prognostic prediction system, identifying that the group with good prognosis had a significantly higher survival rate when compared with the group with bad prognosis (P=4.21e-12; Fig. 5C). In addition, the PDAC dataset downloaded from the TCGA database served as an independent validation dataset for the prognostic prediction system. The AUC and KM survival curve in Fig. 6A demonstrates that the survival rate of the good prognosis group was significantly higher than that of the bad prognosis group (P=3.17e-08). Furthermore, the microarray data of E-MEXP-2780 was used to validate the prognostic prediction system, identifying that the group with good prognosis exhibited a significantly higher survival rate when compared with the bad prognosis group (P=1.58e-06; Fig. 6B). Therefore, the prognostic prediction system could accurately classify the samples from prognostic level.
Figure 6.
AUC and Kaplan-Meier survival curves for the pancreatic ductal adenocarcinoma dataset downloaded from (A) The Cancer Genome Atlas database and (B) the microarray data of E-MEXP-2780. AUC, area under the receiver operating characteristic curve.
Construction of co-expression network for the signature genes and identification of the key genes
The co-expression network constructed for the signature genes included 56 signature genes and 237 edges (Fig. 7). A total of 14 GO_biological process (BP) terms and five KEGG signaling pathways were enriched for the genes involved in the co-expression network (Table II). Meanwhile, the five signaling pathways were merged into the co-expression network for the signature genes. As presented in Fig. 7, phosphoenolpyruvate carboxykinase 1 (PCK1) was enriched in all five signaling pathways, and STK11 was involved in three signaling pathways. In addition, THBS1 was enriched in the phosphoinositide 3-kinase (PI3K)-Akt signaling pathway (P=0.017456). In addition, KM survival analysis was performed for STK11 (Fig. 8A) and PCK1 (Fig. 8B) based on the GSE28735 and GSE62452 datasets.
Figure 7.
Co-expression network constructed for the signature genes. The signaling pathways enriched for the signature genes were merged into the network and the genes within the same circle were enriched in the same signaling pathway. AMPK, AMP-activated protein kinase; PI3K, phosphoinositide 3-kinase; PPAR, peroxisome proliferator-activated receptor.
Table II.
GO_ BP terms and signaling pathways enriched for the signature genes involved in the co-expression network.
Category
Description
Gene no.
P-value
Gene symbol
GO_BP
GO:0010565~regulation of cellular ketone metabolic process
3
0.02095
APOA4, MLYCD, STK11
GO:0030300~regulation of intestinal cholesterol absorption
2
0.023282
APOA4, APOA1
GO:0046486~glycerolipid metabolic process
4
0.02552
APOA4, GPD1, APOA1, PCK1
GO:0010873~positive regulation of cholesterol esterification
2
0.02711
APOA4, APOA1
GO:0046782~regulation of viral transcription
2
0.02711
TARBP2, MDFIC
GO:0010872~regulation of cholesterol esterification
2
0.030924
APOA4, APOA1
GO:0048524~positive regulation of viral reproduction
2
0.038507
TARBP2, MDFIC
GO:0033700~phospholipid efflux
2
0.038507
APOA4, APOA1
GO:0051186~cofactor metabolic process
4
0.040832
AMBP, GPD1, BAAT, MLYCD
GO:0002683~negative regulation of immune system process
Kaplan-Meier survival curves for (A) STK11 and (B) PCK1. HR, hazard ratio; STK1, serine/threonine kinase 11; PCK1, phosphoenolpyruvate carboxykinase 1.
Discussion
In the current study, a total of 480 DEGs were identified from five datasets. In addition, 259 prognosis-associated genes were screened from GSE28735 and GSE62452, and BTF3 was among the top six prognosis-associated genes. Furthermore, a total of six modules were identified from the co-expression network for the prognosis-associated genes. A prognostic prediction system composed of 67 signature genes, which included BTF3, STK11, THBS1, RPL38 and SCTR, was finally constructed and validated. Finally, the co-expression network for the signature genes was constructed and the signature genes involved in the co-expression network were enriched in five signaling pathways. In particular, STK11 was involved in three of the signaling pathways.STK11/LKB1 causes somatic mutations in intraductal papillary mucinous neoplasms (IPMNs), sporadic pancreatic adenocarcinomas and biliary adenocarcinomas, and its expression is abrogated in pancreatic and biliary neoplasms (24,25). Sato et al (26) identified that the STK11/LKB1 gene is associated with the development and progression of certain IPMNs (26). The LKB1-p21 axis cooperates with the Kirsten ratsarcoma viral oncogene homolog (Kras) mutation to inhibit PDAC in vivo, and downregulated LKB1 may function in promoting PC by replacing the p53 mutation (27,28). As a Peutz-Jeghers syndrome gene, LKB1 induces apoptosis of PC cells in a p73-dependent manner (29). These studies indicate that STK11 is implicated in the prognosis of PDAC.Stromal expression levels of THBS1 are a prognostic marker and invasive indicator in IPMN (30). By upregulating THBS1 and caveolin-1 and downregulating cyclin D1, metronomic C2 and AL6 analogs perform antiangiogenic and antitumor roles in PC (31). THBS1 is implicated in cell growth and metastasis of PC cells, and stromal THBS1 immunoreactivity may be used for predicting the prognosis of PC patients (32). By promoting the expression of matrix metalloproteinase-9, THBS1 is important in mediating matrix remodeling in the invasion of pancreatic adenocarcinoma (33,34). The signature gene, THBS1 was enriched in the PI3K-Akt signaling pathway, indicating that THBS1 may be involved in the prognosis of PDAC via the PI3K-Akt signaling pathway.Overexpressed BTF3 functions as a transcriptional regulator by mediating the transcription of tumor-associated genes in PDAC (35). RPL38, FOS-like antigen-1 and uridine phosphorylase are highly expressed in PC cell lines; therefore, they have the potential to serve as tumor markers or in tumor targeting (36). SCTR are key in regulating healthy pancreatic ductal epithelial cells, and its silence may contribute to tumor growth and progression of PC (37). SCTR is overexpressed in non-neoplastic pancreas ducts and its isoforms may be correlated with decreased secretin binding in pancreatic ductal tumors, indicating that SCTRs may represent promising clinical targets in pancreatic tumors (38). Therefore, BTF3, RPL38, and SCTR may be important in the prognosis of PDAC.There were certain limitations of the present study. The findings obtained from bioinformatics analysis require further validation via experimental studies. However, the validation experiment could not be performed in the current study due to being limited by experimental conditions and sample sources. The present results may provide valuable data for future investigations.In conclusion, a total of 480 DEGs were identified from five datasets, and 259 prognosis-associated genes were screened from GSE28735 and GSE62452. Furthermore, the prognostic prediction system composed of 67 signature genes was constructed and validated. Notably, signature genes, including BTF3, STK11, THBS1, RPL38 and SCTR may function in the prognosis of PDAC.
Authors: Christopher L Wolfgang; Joseph M Herman; Daniel A Laheru; Alison P Klein; Michael A Erdek; Elliot K Fishman; Ralph H Hruban Journal: CA Cancer J Clin Date: 2013-07-15 Impact factor: 508.702
Authors: Michele K McElroy; Sharmeela Kaushal; Hop S Tran Cao; A R Moossa; Mark A Talamini; Robert M Hoffman; Michael Bouvet Journal: Mol Cancer Ther Date: 2009-07-07 Impact factor: 6.261
Authors: Jennifer P Morton; Nigel B Jamieson; Saadia A Karim; Dimitris Athineos; Rachel A Ridgway; Colin Nixon; Colin J McKay; Ross Carter; Valerie G Brunton; Margaret C Frame; Alan Ashworth; Karin A Oien; T R Jeffry Evans; Owen J Sansom Journal: Gastroenterology Date: 2010-05-06 Impact factor: 22.682