Literature DB >> 28611293

Identification of molecular biomarkers for pancreatic cancer with mRMR shortest path method.

Shuhua Shen¹, Tuantuan Gui², Chengcheng Ma^3,4.

Abstract

The high mortality rate of pancreatic cancer makes it one of the most studied diseases among all cancer types. Many researches have been conducted to understand the mechanism underlying its emergence and pathogenesis of this disease. Here, by using minimum-redundancy-maximum-relevance (mRMR) method, we studied a set of transcriptome data of pancreatic cancer. As we gradually added features to achieve the most accurate classification results of Jackknife, a gene set of 9 genes was identified. They were NHS, SCML2, LAMC2, S100P, COL17A1, AMIGO2, PTPRR, KPNA7 and KCNN4. Through STRING 2.0 protein-protein interactions (PPIs) analysis, 40 proteins were identified in the shortest paths between genes in the gene set, 30 of them passed the permutation test, which indicated they were hubs in the background network. Those genes in the protein-protein interaction network were enriched to 37 functional modules, such as: negative regulation of transcription from RNA polymerase II promoter, negative regulation of ERK1 and ERK2 cascade and BMP signaling pathway. Our study indicated new mechanism of pancreatic cancer, suggesting potential therapeutic targets for further study.

Entities: Chemical Disease Gene Species

Keywords: biomarker; minimum-redundancy-maximum-relevance (mRMR); pancreatic cancer

Mesh：

Substances：
Biomarkers, Tumor

Year: 2017 PMID： 28611293 PMCID： PMC5522256 DOI： 10.18632/oncotarget.18186

Source DB: PubMed Journal: Oncotarget ISSN： 1949-2553

INTRODUCTION

Pancreatic cancer is one of the most lethal diseases among all cancer types, leading to about 79,400 deaths in China [1] and 330,400 deaths worldwide [2]. The five-year survival rate is only 2–7% [3, 4]. This poor outcome could be largely due to the late diagnosis. The mechanism underlying its progression is still unclear. The expression profiles of pancreatic cancer had been widely studied, revealing several molecular factors affecting various aspects of pancreatic cancer [5]. Terris et al. found four genes — caveolin 1, glypican 1, growth arrest-specific 6 protein, cysteine-rich angiogenic inducer 61 were associated with the pathogenesis of pancreatic cancer and possible early stage pancreatic cancer indicators [6]. For some patients, PCK1, SFRP2 were identified as potential metastasis markers in a study comparing the expression profiles primary and metastasis pancreatic cancer [7]. GSTT1, TOP2A, CASP3 and ABCC2 had been found to possess gemcitabine sensitivity predictive properties [8]. These studies usually adopted differentially expressed genes (DEG) methods. This means the studies considered the relevance between expression levels and certain phenotype separately, ignoring the relationships between the genes. These methods would bring redundancies to the findings, mixing the most representative genes into the bulk results. Feature selection often means the process of maximizing the classification accuracy with the combination of the selected features integrating into a classification model. To that end, people select the features passing certain relevance threshold. Relevance is usually characterized in terms of correlation or mutual information. But many genes work closely as a functional module. The interactions among them may contribute to class distinctions. However, combinations of individually good features are not necessarily a good gene set representing the whole picture underlying the biological processes [9]. Minimum-redundancy-maximum-relevance (mRMR) had been widely used in several biological fields such as predicting lysine ubiquitination [10], protein-protein interactions [11] and HIV Progression-Related Genes [12]. This method considers the associations between the features and the target phenotype, together with the inner relationships among the features. Comparing with the other methods, mRMR showed better classification accuracy [13]. The proteins work together to form functional modules. Investigating the disease candidate genes should consider these interactions for better understanding how the candidates function. Among the interaction databases, STRING (Search Tool for the Retrieval of Interacting Genes) [14] is most frequently used because of its millions interactions and the high quality scoring system. With this powerful data source, we can restore the overall functional impact of the genes of our interest. In this study, we performed a Minimum-redundancy-maximum-relevance (mRMR) based transcriptome study. The objective was to find a set of genes which best classifying these two types of samples, explaining some mechanisms of the pathogenesis of pancreatic cancer. Based on graphic analysis [15] on STRING PPIs network we further identified pancreatic cancer association genes and functional modules worthy for further experimental studies.

RESULTS

Gene probes identified by mRMR-IFS

We retrieved 45 pancreatic cancer and 45 non-tumor samples’ gene expression profiles from GEO (GSE28735) consisting 28,869 probes. We used mRMR-IFS method to do feature selection and used K-nearest-neighbor model to do phenotype classification (see Methods). We adopted K-nearest-neighbor model and jackknife validation, and calculated the classification accuracy of 1 to 500 probes (Figure 1). We found a set of 10 probes with the accuracy of 0.88, which is close to the highest accuracy of 0.89 with 80 probes. The 10 gene probes set would be more representative than 80 gene probes set, so we choose 10 gene probes (Table 1). The differential expression of LAMC2, S100P, KPNA7, AMIGO2 and KCNN4 had also been reported in other studies [16-20] (Figure 2). Some genes had been reported to be related to PDAC, such as LAMC2, S100P and KPNA7 [18, 21, 22]. We also identified novel pancreatic cancer genes, such as SCML2, COL17A1, AMIGO2, PTPRR, suggesting our method might be able to predict novel PDAC-related genes.

Figure 1

IFS curve to determine the number of features used in prediction

We used an IFS curve to determine the number of features finally used in the mRMR feature selection. Prediction accuracy reached its second maximum value at 10 gene probes. The x-axis indicates the number of probes used for classification, and the y-axis is the prediction accuracy.

Table 1

Top 10 of the genes by betweenness in the shortest paths

Probe ID	seqname	STRAND	START	STOP	Gene Symbol	mRMR score
8166266	chrX	+	17393543	17754114	NHS	0.26468
8171561	chrX	−	18257433	18372847	SCML2	0.262552
7908072	chr1	+	1.83E+08	1.83E+08	LAMC2	0.280165
8093950	chr4	+	6694796	6698897	S100P	0.300541
8017098	chr17	−	56736510	56736657		0.270637
7936144	chr10	−	1.06E+08	1.06E+08	COL17A1	0.28748
7962579	chr12	−	47469490	47473734	AMIGO2	0.270462
7964907	chr12	−	71031853	71314586	PTPRR	0.232587
8141263	chr7	−	98775543	98805089	KPNA7	0.188924
8037408	chr19	−	44270685	44285409	KCNN4	0.817682

Figure 2

Expression differences of LAMC2, S100P, KPNA7, AMIGO2 and KCNN4 between tumors and non-tumors

Format: PNG This figure shows the expression differences of LAMC2, S100P, KPNA7 and AMIGO2 between tumors and non-tumors, separately. Error bars indicate standard errors.

IFS curve to determine the number of features used in prediction

Expression differences of LAMC2, S100P, KPNA7, AMIGO2 and KCNN4 between tumors and non-tumors

Format: PNG This figure shows the expression differences of LAMC2, S100P, KPNA7 and AMIGO2 between tumors and non-tumors, separately. Error bars indicate standard errors.

A PPI sub-network of the genes

We further built up an undirected network using PPIs from STRING [14]. The protein pairs with PPI score greater than 0.8 were used to form high confidence network. From the 10 gene probes identified by mRMR-IFS, we found 9 genes corresponding to 27 proteins in STRING. 8 proteins were in the high confidence network. We computed the shortest path of every pairs of proteins using the Dijkstra's algorithm [15]. The shortest paths were integrated into a sub-network (Figure 3), and the sub-network contains 51 protein-protein interactions involving 40 proteins. We conducted a permutation test to evaluate the significance of betweenness of the proteins against background network. 30 proteins passing the test were selected and ranked according to their betweenesses (Supplementary Table 1). Among the betweenesses, MAPK1's had the largest, which was 14, indicating there were at least 7 shortest paths going through this gene.

Figure 3

PPI network of shortest paths among 40 computational method identified proteins

Shortest paths between each pair of the 8 proteins (black) which from the 40 computational method selected proteins were identified in the STRING PPI network. Proteins in black are the 8 identified genes using the computational method which also present in the STRING PPI network; red ones are shortest paths proteins passed the permutation test; blue are not passed ones.

PPI network of shortest paths among 40 computational method identified proteins

Functional enrichment analysis of the genes

Using DAVID, we implemented GO functional enrichment analysis and KEGG pathway analysis with the 10 probes. Results showed that these genes were significantly enriched in the cell adhesion in organelle (Supplementary Table 2). Only one KEGG pathway was significantly enriched (hsa04974: Protein digestion and absorption) (p-value = 0.038, Supplementary Table 3). We also performed KEGG pathway and GO functional enrichment with the 30 hub genes on the shortest paths. The GO results showed that many genes were significantly enriched in the modules of negative regulation of transcription from RNA polymerase II promoter (Supplementary Table 4). And the KEGG pathway results showed that these genes were significantly enriched in the TGF-beta signaling pathway (hsa05212: Pancreatic cancer, p-value=3.08E-05, Supplementary Table 5).

DISCUSSION

In a previous study, Zhang et al. identified 277 genes to be differentially expressed with this set of data [23]. By our approach, a more compact set of features was identified (Supplementary Table 4) with high classification accuracy. Among the more than 20,000 probes in the transcriptome data, we selected 10 probes corresponding to 9 genes as the most optimized predictors. They are NHS, SCML2, LAMC2, S100P, COL17A1, AMIGO2, PTPRR, KPNA7 and KCNN4. Some of them had been proved to be associated with pancreatic cancer. LAMC2 (Laminin subunit gamma-2) Laminins are extracellular matrix glycoproteins. Studies showed that they are involved in many biological processes including cell adhesion, differentiation, and metastasis [24-26]. The overexpression of LAMC2 had been shown to be a predictive marker of pancreatic cancer [21]. Another microarray study also found it overexpressed in PDAC tumor epithelia. Moreover, its expression level negatively correlated with survival [27]. Nerve invasion is a prominent feature of pancreatic cancer. In a study with cell line, mouse model and patients’ surgical tissues, overexpression of LAMC2 was observed to be positively associated with nerve invasion distance [28]. S100P (S100 calcium binding protein P) is a member of S100 family of proteins. S100 regulates cell cycle progression and differentiation [29]. Microarray study had shown it specifically expressed in the neoplastic epithelium of pancreatic cancer [22]. The expression level of S100P is correlated with the rates of cell proliferation, survival, migration and invasion, which makes S100P protein a major promoting factor in the pathogenesis of pancreatic cancer [30]. The abnormal expression might be because of hypomethylation [31]. Overexpression of S100P is an early marker of pancreatic cancer, which down-regulates the levels of cytoskeletal proteins, which disrupts the actin cytoskeleton network and changes in the phosphorylation status of cofilin. S100P also un-regulates expression of two cellular invasion factors S100A6 and aspartic protease cathepsin [32]. AMIGO2 (Adhesion Molecule With Ig Like Domain 2) also named as DEGA (Differentially expressed in gastric adenocarcinomas). As its name DEGA, it may induce several deterious alterations including aneuploidy and abnormal adhesion in gastric cancers [33, 34]. Antibodies against AMIGO2 had been proved to be effective to pancreatic cancer in xenograft models [17]. KPNA7 (karyopherin subunit alpha 7) is a member of importin α family. In vitro experiments had demonstrated that KPNA7 was up-regulated in pancreatic cancer. Silencing KPNA7 could increase the level of p21, promote G1 arrest, and increase autophagy [18]. It is an important factor promoting the malignant of pancreatic cancer. KCNN4 (potassium calcium-activated channel subfamily N member 4) consists Ca2+ activated voltage-independent K+ channel [35]. Ca2+-activated K+ channels are involved in anion and K+ transport in stimulated pancreatic cells [36]. In vitro study had shown that blocking the channels could inhibit the growth of pancreatic cancer, which suggested the important role of them in the proliferation of pancreatic cancer [16].

MATERIALS AND METHODS

Dataset

The microarray gene expression profiling dataset was downloaded from NCBI Gene Expression Omnibus (accession no.: GSE28735). The dataset contains 45 tumor and 45 non-tumor patients with pancreatic ductal adenocarcinoma (PDAC) [23].

Feature selection

To rank the importance of the features that best distinguish pancreatic ductal adenocarcinoma tumor from normal adjacent tissues, we applied mRMR method, which ranks the features according to their relevance to the target phenotypes minus the redundancy between the features [37]. In our study, we used R package mRMRe to implement mRMR [38]. In mRMRe, both relevance and redundancy are quantified by mutual information (MI): Where x and y are two variables to be tested, p(x) and p(y) are the marginal probabilistic densities, and p(x, y) is their joint probabilistic density, and I(x, y) represents the MI. Let X = {x1,.....,xn} denote the set of gene probes (input features), and let y denote the phenotype (input target). Given the feature with highest MI between the phenotype xi, the set of ranked features S is initialized with xi. Next, the best balance between maximal relevance and minimum redundancy in the remaining feature xj is added to S. It is selected by maximizing the score q according to the following equation: The selection step is repeated until a desirable number of ranked features N, which was 500 in our study. To determine an appropriate subset of the ranked feature list, we chose incremental feature selection (IFS) to determine the most suitable number of the genes in the feature subset si [39]: For example, N is 500, then the first feature subset is s1 = {f1}, the second feature subset is s2 = {f1, f2}, and the last feature subset is SN = {f1, f1,...f500}. The feature subset with the best prediction accuracy is selected.

Prediction engine

We used k-nearest neighbor method to predict the phenotype of each individual. In our study, the distance between two individuals was defined according to Chou and coworkers [40, 41]: Where i1 and i2 represent two individuals, D refers to the distance between the two individuals, and e1 and e1 are vectors of selected feature sets (expression levels of selected genes) of the two individuals.

Validation method

Independent dataset test, subsampling test, and jackknife test are three validation methods that are often used in statistical model validation. Comparing to other two validation methods jackknife test is better at avoiding the arbitrariness that exists in the independent dataset and subsampling test [40, 42, 43]. In jackknife test, both the training dataset and testing dataset are open. Each sample will be in turn moved between the training dataset and testing dataset. The prediction accuracy was formulated as: Where TP represents the number of true positives, TN represents the number of true negatives, FP represents the number of false positives, and FN represents the number of false negatives.

Graphics approach and shortest paths tracing

The initial weighted PPIs network was retrieved from STRING(version 10) [14], and used to constructed a graph G(V,E). The database contains known and predicted protein interactions, which provides intuitive insights and overall structure properties to study complex biological systems. Based on the PPIs network, we used Dijkstra's algorithm [15] to identify the shortest path between any pair of proteins that were identified by mRMR-IFS. The visualization of subnetwork with the shortest paths was done by Cytoscape [44].

Permutation test

To test whether the 40 shortest path genes were hubs in the background network, we conducted a permutation test. Occurrences of the 40 proteins were counted up in the shortest paths between randomly selected 8 proteins when they had higher betweenness than that of shortest path genes. This process was repeated 1000 times. The p-value was calculated as the proportion of the occurrence times of the 40 proteins in 1000 permutations. Shortest path genes with a p-value below 0.05 were considered as significant pancreatic cancer related in this study.

Pathway enrichment analysis

We used the functional annotation tool DAVID [45] for KEGG pathway enrichment and GO functional enrichment analysis. Significant functional modules were selected with a corrected p-value < 0.05.

CONCLUSIONS

In this study, we implemented a minimum-redundancy-maximum-relevance (mRMR) based transcriptional profile study to present a comprehensive view of the features in pancreatic cancer. We identified NHS, SCML2, LAMC2, S100P, COL17A1, AMIGO2, PTPRR, KPNA7 and KCNN4 as closely related genes to the disease. Some of them had been validated in vitro and/or in vivo. From the functional analysis of PPIs network, RNA polymerase II and growth factor function showed importance to this disease. In conclusion, our method provided solid and novel insights to this mortal disease, suggesting several genes and functions that worth further investigations.

38 in total

1. Identification of maspin and S100P as novel hypomethylation targets in pancreatic cancer using global gene expression profiling.

Authors: Norihiro Sato; Noriyoshi Fukushima; Hiroyuki Matsubayashi; Michael Goggins
Journal: Oncogene Date: 2004-02-26 Impact factor: 9.867

2. Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers.

Authors: Kuo-Chen Chou; Hong-Bin Shen
Journal: J Proteome Res Date: 2006-08 Impact factor: 4.466

3. The role of S100P in the invasion of pancreatic cancer cells is mediated through cytoskeletal changes and regulation of cathepsin D.

Authors: Hannah J Whiteman; Mark E Weeks; Sally E Dowen; Sayka Barry; John F Timms; Nicholas R Lemoine; Tatjana Crnogorac-Jurcevic
Journal: Cancer Res Date: 2007-09-15 Impact factor: 12.701

4. Characterization of gene expression profiles in intraductal papillary-mucinous tumors of the pancreas.

Authors: Benoit Terris; Ekaterina Blaveri; Tatjana Crnogorac-Jurcevic; Melanie Jones; Edoardo Missiaglia; Philippe Ruszniewski; Alain Sauvanet; Nicholas R Lemoine
Journal: Am J Pathol Date: 2002-05 Impact factor: 4.307

5. An intermediate-conductance Ca2+-activated K+ channel is important for secretion in pancreatic duct cells.

Authors: Mikio Hayashi; Jing Wang; Susanne E Hede; Ivana Novak
Journal: Am J Physiol Cell Physiol Date: 2012-05-02 Impact factor: 4.249

6. Cytoplasmic expression of laminin gamma2 chain correlates with postoperative hepatic metastasis and poor prognosis in patients with pancreatic ductal adenocarcinoma.

Authors: Shinichiro Takahashi; Takahiro Hasebe; Tatsuya Oda; Satoshi Sasaki; Taira Kinoshita; Masaru Konishi; Takenori Ochiai; Atsushi Ochiai
Journal: Cancer Date: 2002-03-15 Impact factor: 6.860

7. Blockage of intermediate-conductance Ca2+-activated K+ channels inhibit human pancreatic cancer cell growth in vitro.

Authors: Heike Jäger; Tobias Dreker; Anita Buck; Klaudia Giehl; Thomas Gress; Stephan Grissmer
Journal: Mol Pharmacol Date: 2004-03 Impact factor: 4.436

8. Gene expression analysis for predicting gemcitabine sensitivity in pancreatic cancer patients.

Authors: Jianfeng Bai; Naohiro Sata; Hideo Nagai
Journal: HPB (Oxford) Date: 2007 Impact factor: 3.647

9. Cytoscape 2.8: new features for data integration and network visualization.

Authors: Michael E Smoot; Keiichiro Ono; Johannes Ruscheinski; Peng-Liang Wang; Trey Ideker
Journal: Bioinformatics Date: 2010-12-12 Impact factor: 6.937

10. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

7 in total

1. Predicting Prognostic Effects of Acupuncture for Depression Using the Electroencephalogram.

Authors: Xiaomao Fan; Xingxian Huang; Yang Zhao; Lin Wang; Haibo Yu; Gansen Zhao
Journal: Evid Based Complement Alternat Med Date: 2022-03-02 Impact factor: 2.629

2. Protein tyrosine phosphatase receptor type R (PTPRR) antagonizes the Wnt signaling pathway in ovarian cancer by dephosphorylating and inactivating β-catenin.

Authors: Yuetong Wang; Jian Cao; Weiwei Liu; Jiali Zhang; Zuo Wang; Yiqun Zhang; Linjun Hou; Shengmiao Chen; Piliang Hao; Liye Zhang; Min Zhuang; Yang Yu; Dake Li; Gaofeng Fan
Journal: J Biol Chem Date: 2019-10-25 Impact factor: 5.157

3. Whole Transcriptomic Analysis of Apigenin on TNFα Immuno-activated MDA-MB-231 Breast Cancer Cells.

Authors: David Bauer; Elizabeth Mazzio; Karam F A Soliman
Journal: Cancer Genomics Proteomics Date: 2019 Nov-Dec Impact factor: 4.069

4. Inferring Novel Tumor Suppressor Genes with a Protein-Protein Interaction Network and Network Diffusion Algorithms.

Authors: Lei Chen; Yu-Hang Zhang; Zhenghua Zhang; Tao Huang; Yu-Dong Cai
Journal: Mol Ther Methods Clin Dev Date: 2018-06-21 Impact factor: 6.698

Review 5. Ion Channel Signature in Healthy Pancreas and Pancreatic Ductal Adenocarcinoma.

Authors: Julie Schnipper; Isabelle Dhennin-Duthille; Ahmed Ahidouch; Halima Ouadid-Ahidouch
Journal: Front Pharmacol Date: 2020-10-16 Impact factor: 5.810

6. Establishment of an antibody specific for AMIGO2 improves immunohistochemical evaluation of liver metastases and clinical outcomes in patients with colorectal cancer.

Authors: Keisuke Goto; Mitsuhiko Osaki; Runa Izutsu; Hiroshi Tanaka; Ryo Sasaki; Akimitsu Tanio; Hiroyuki Satofuka; Yasuhiro Kazuki; Manabu Yamamoto; Hiroyuki Kugoh; Hisao Ito; Mitsuo Oshimura; Yoshiyuki Fujiwara; Futoshi Okada
Journal: Diagn Pathol Date: 2022-01-30 Impact factor: 2.644

7. Depletion of nuclear import protein karyopherin alpha 7 (KPNA7) induces mitotic defects and deformation of nuclei in cancer cells.

Authors: Elisa M Vuorinen; Nina K Rajala; Teemu O Ihalainen; Anne Kallioniemi
Journal: BMC Cancer Date: 2018-03-27 Impact factor: 4.430

7 in total