Qiang Wang1, Fengling Wang1, Jiajia Lv1,2, Junfang Xin1, Longxiang Xie1, Wan Zhu3, Yitai Tang4, Yongqiang Li1, Xiaofang Zhao5, Yunlong Wang6, Xianzhe Li2, Xiangqian Guo1. 1. Department of Preventive Medicine, Institute of Biomedical Informatics, Bioinformatics Center, Cell Signal Transduction Laboratory, School of Software, School of Basic Medical Sciences, Henan University, Kaifeng, Henan 475004, P.R. China. 2. Department of Thoracic Surgery, The Affiliated Nanshi Hospital of Henan University, Nanyang, Henan 473003, P.R. China. 3. Department of Anesthesia, Stanford University, Stanford, CA 94305, USA. 4. Department of Pathology, Stanford University, Stanford, CA 94305, USA. 5. Jiangsu SuperBio Co., Ltd., Nanjing, Jiangsu 210061, P.R. China. 6. Henan Bioengineering Research Center, Zhengzhou, Henan 450046, P.R. China.
Abstract
Esophageal squamous cell carcinoma (ESCC) is one of the most common types of cancer worldwide. However, operative diagnostic and prognostic systems for ESCC remain to be established. To improve assessment of the prognosis for patients with ESCC, the present study developed an online consensus survival tool for ESCC, termed OSescc. OSescc was built using 264 ESCC cases with gene expression data and relevant clinical information obtained from the Gene Expression Omnibus and The Cancer Genome Atlas databases. Kaplan-Meier survival plots with hazard ratios and P-values were generated by OSescc to predict the association between potential biomarkers and relapse free survival and overall survival. In addition, the current study integrated a function by which one could assess the prognosis based on an individual probe or the mean value of multiple probes for each gene, which helped improve the evaluation of the validity and reliability of the potential prognosis biomarkers. OSescc can be accessed at bioinfo.henu.edu.cn/DBList.jsp.
Esophageal squamous cell carcinoma (ESCC) is one of the most common types of cancer worldwide. However, operative diagnostic and prognostic systems for ESCC remain to be established. To improve assessment of the prognosis for patients with ESCC, the present study developed an online consensus survival tool for ESCC, termed OSescc. OSescc was built using 264 ESCC cases with gene expression data and relevant clinical information obtained from the Gene Expression Omnibus and The Cancer Genome Atlas databases. Kaplan-Meier survival plots with hazard ratios and P-values were generated by OSescc to predict the association between potential biomarkers and relapse free survival and overall survival. In addition, the current study integrated a function by which one could assess the prognosis based on an individual probe or the mean value of multiple probes for each gene, which helped improve the evaluation of the validity and reliability of the potential prognosis biomarkers. OSescc can be accessed at bioinfo.henu.edu.cn/DBList.jsp.
Esophageal cancer (EC) is the tenth most common cause of cancer-associated mortality in United States in 2018 (1), and consists of two histological types; esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma (2). ESCC has been the more common histological type of EC for centuries, particularly in Eastern Asia, and accounted for ~90% of all cases of EC worldwide, in 2012 (3). In addition, ESCC was the fourth most fatal type of cancer in China, in 2015 (4). A common treatment method for ESCC is esophagectomy, followed by radiation or neoadjuvant chemotherapy for the residual tumor tissues (5–7). However, these treatment regimens have failed to substantially improve the survival for patients with ESCC over the last few decades (8). Despite the rapid and successful identification of prognostic biomarkers by genome-wide gene expression profiling in numerous types of cancer, including breast cancer (9–11), lung adenocarcinoma (12), non-small cell lung cancer (13) and rectal adenocarcinoma (14), only a small number of biomarkers for ESCC have been identified, including cortactin (15) and ribonuclease 3 (RNASEN; also known as drosha ribonuclease III) (16).The intra- and inter-patient molecular heterogeneity of cancer has been repeatedly reported (17–25), and makes it difficult to identify and establish universal prognostic biomarkers. To assist in the development of novel prognostic biomarkers, a number of prognostic databases have been established for breast, ovarian, lung and gastric cancer, and numerous new prognostic molecular markers have been identified (26–29). However, to the best of our knowledge, there is no relevant prognostic database for ESCC.To facilitate the validation of prognostic biomarkers of ESCC and the evaluation of their sensitivity in independent studies, and to help clinicians and scientists improve their understanding of the clinically relevant gene expression data, the present study developed an online consensus survival tool for ESCC, termed OSecc, a web-based interactive survival analysis tool for ESCC. Using OSescc enables individuals to assess overall survival (OS) and relapse free survival (RFS) based on the expression of given genes or probes. OSescc is accessible at bioinfo.henu.edu.cn/DBList.jsp.
Materials and methods
Data acquisition
Datasets were obtained from Gene Expression Omnibus (GEO; ncbi.nlm.nih.gov/geo/) and The Cancer Genome Atlas (TCGA; cancergenome.nih.gov) by searching with the keyword ‘esophageal carcinoma’, followed by manually checking for ‘more than 20 ESCC cases with clinical survival data and gene expression profiling data’.
Development of OSescc
The OSescc server is hosted in a Windows Tomcat server. Server-side scripts are developed in Java v8 (https://www.oracle.com/technetwork/java/javaee/overview/index.html), which control the request of analysis and the return of the results. The database system is managed by a Microsoft Standard Query Language (SQL) server 2008 (https://www.microsoft.com/de-de/download/details.aspx?id=22985), which stores the gene expression and clinical data. The RODBC package of R serves as a connecting layer between the R v3.5.2 (https://www.r-project.org/) and the SQL server database. The Java Database Connectivity driver serves as a connecting layer between the Java and the SQL Server database. The R package ‘survival’ was used to generate Kaplan-Meier (KM) survival curves and calculate the hazard ratio (HR) and log-rank P-value. The central server for OSescc can be accessed at bioinfo.henu.edu.cn/DBList.jsp. A system architecture diagram is presented in Fig. 1. Screenshots of the database interface and the result retrieval are presented in Fig. 2.
Figure 1.
Diagram of the system architecture. JSP, Java Server Pages; JDBC, Java Database Connectivity; SQL, Standard Query Language; RODBC, R Open Database Connectivity.
Figure 2.
Screenshot of the database interface. HR, hazard ratio; CI, confidence interval; OSecc, online consensus survival tool for esophageal squamous cell carcinoma; TCGA, The Cancer Genome Atlas; CSF1, colony stimulating factor 1; OS, overall survival.
Results
Two datasets that met the criteria were obtained from GEO (GSE53625) (30) and TCGA (31) database (Table I). These datasets were composed of a total of 264 unique ESCC cases with comprehensive clinical information, including smoking history, alcohol history, TNM stage (32), lymph node status, country and ethnicity.
Table I.
Clinical characteristics of the patients with esophageal squamous cell carcinoma collected in OSescc.
Characteristic
GSE53625
TCGA
Cohort size, n
179
85
Alcohol drinker, n (%)
Yes
106 (59.22)
61 (71.76)
No
73 (40.78)
23 (27.06)
NA
0 (0.00)
1 (1.18)
Smoker, n (%)
Yes
114 (63.69)
56 (65.88)
No
65 (36.31)
27 (31.76)
NA
0 (0.00)
2 (2.36)
TNM stage, n (%)
I
10 (5.59)
1 (1.18)
II
77 (43.01)
24 (28.24)
III
92 (51.40)
13 (15.29)
IV
0 (0.00)
4 (4.70)
NA
0 (0.00)
43 (50.59)
Lymph node status, n (%)
Yes
0 (0.00)
20 (23.53)
No
0 (0.00)
34 (40.00)
NA
179 (100.00)
31 (36.47)
Ethnicity, n (%)
Caucasian
0 (0.00)
40 (47.06)
Asian
179 (100.00)
40 (47.06)
African-American
0 (0.00)
2 (2.35)
NA
0 (0.00)
3 (3.53)
TCGA, The Cancer Genome Atlas; NA, not available; OSescc, online consensus survival tool for esophageal squamous cell carcinoma.
In OSescc, users could input the gene symbol into the ‘Gene symbol’ dialog window (as presented in Fig. 2) to assess the prognostic significance of interested genes/probes in ESCC. As a result, the KM plot would display the association between the inquired gene and the OS or RFS in which the samples can be stratified by different expression levels of the selected genes or probes depending on the user's choice of median, quartile or 30%. The results can be retrieved in a cohort-specific manner or a cohort-combined manner to increase the statistic power, and the HR and log-rank test P-value are generated. In addition, the population can be co-stratified by alcohol, smoking, TNM stage, lymph node status, country and ethnicity prior to running the analysis to investigate the subclass specific prognostic power.Clinicians may also be interested in the association between survival and the expression of an individual prognosis-associated probe or transcript variant. Therefore, the present study developed a function by which OSescc can generate a KM plot not only in a gene-specific manner, with the mean value expression of multiple probes for a given gene, but also in a probe-specific manner, with the expression of a single probe for a given gene, on the GSE53625 dataset. This particular function enables individuals to compare the outcomes from a single probe or from multiple probes, which in turn improves the evaluation of the validity and reliability of the potential prognosis biomarkers. By doing so, the current study was able to further analyze four groups of genes.The first group contained 79 genes; each gene in this group possessed multiple probes and all probes predicted the prognosis with statistical significance towards the same trend of either good or bad prognosis. Therefore, genes in this group could serve as the prognostic biomarkers with high degrees of prediction power.The second group contained only two genes. Probes of either gene contradicted with each other with regard to the predicted prognosis and distinct HR. For example, the ERCC excision repair 5 (ERCC5) gene had two probes (probe nos. 55483 and 152357) in GSE53625. A high expression of probe no. 55483 predicted a significantly worse overall survival rate (HR, 1.9514, P=0.0013), whereas a high expression of probe no. 152375 indicated a significantly improved overall survival rate (HR, 0.5545; P=0.0178). The contradiction between the two probes for the ERCC5 gene resulted in an insignificant prognostic prediction and decreased the sensitivity of outcome prediction using the mean value expression detected by the two probes (Fig. 3).
Figure 3.
Kaplan-Meier plots for the expression of each ERCC5 probe and the mean value expression. A high expression of probe no. 55483 predicted a significantly worse overall survival rate, whereas a high expression of probe no. 152375 indicated a significantly improved overall survival rate. (A) The mean value expression of ERCC5 detected by the two probes. (B) ERCC5 probe no. 55483. (C) ERCC5 probe no. 152357. Esophageal squamous cell carcinoma cases, n=179. ERCC5, ERCC excision repair 5; HR, hazard ratio; CI, confidence interval; OSecc, online consensus survival tool for esophageal squamous cell carcinoma.
The third group included 532 genes, each of which had multiple probes. Notably, for each given gene, although ≥1 probe could offer a statistically significant prognosis prediction, the prediction based on the mean value expression of all probes was not significant. An example is the colony stimulating factor 1 gene, which was detected using three probes (probe nos. 19819, 54057 and 116310), with P-values of 0.7295, 0.2978 or 0.0031, respectively. Whereas probe no. 116310 appeared to be a promising prognosis predictor with P=0.0031, the P-value of log-rank test for the mean value expression of the three probes was 0.9823, which indicates the prognosis prediction was insignificant (Fig. 4).
Figure 4.
Kaplan-Meier plots for the expression of each CSF1 probe and the mean value expression of three probes. Probe no. 116310 demonstrated a statistically significant prognosis prediction; however, the other probes and the mean value expression of all probes were not significant. (A) The mean value expression detected using three probes. (B) Probe no. 19819. (C) Probe no. 54057. (D) Probe no. 116310. Esophageal squamous cell carcinoma cases, n=179. CSF1, colony stimulating factor 1; HR, hazard ratio; CI, confidence interval; OSecc, online consensus survival tool for esophageal squamous cell carcinoma.
The last group included 135 genes, all with multiple probes. Notably, for each gene, although none of the probes predicted the prognosis with a statistically significant power, the prognosis prediction based on the mean value of all probes was significant. In other words, for genes in this group, the prognosis prediction in a gene-specific manner, but not in a probe-specific manner, was statistically significant. For example, the β-1,3-N-acetylglucosaminyltransferase 7 gene has two probes (probe nos. 65517 and 69543). The prognostic significance of each probe was 0.64075 and 0.76305, respectively. However, the P-value of log-rank test for the mean value expression detected by the two probes was 0.0081 (Fig. 5).
Figure 5.
Kaplan-Meier plots for the expression of each B3GNT7 probe and the mean value expression. Neither probe nos. 65517 or 69543 for B3GNT7 predicted the prognosis with a statistically significant power; however, the prognosis prediction based on the mean value of all probes was significant. (A) The mean value expression of B3GNT7 detected using two probes. (B) B3GNT7 probe no. 65517. (C) B3GNT7 probe no. 69543. Esophageal squamous cell carcinoma cases, n=179. B3GNT7, β-1,3-N-acetylglucosaminyltransferase 7; HR, hazard ratio; CI, confidence interval; OSecc, online consensus survival tool for esophageal squamous cell carcinoma.
Previous studies have reported several potential biomarkers for ESCC prognosis, including ABL proto-oncogene 1, non-receptor tyrosine kinase (ABL1) (33), transcription elongation factor A protein like 1 (TCEAL1) (34), tropomyosin 2 (35), cystatin C (36), REV3 like, DNA directed polymerase ζ catalytic subunit (37) and urokinase-type plasminogen activator (38). In OSescc, as expected, ABL1 and TCEAL1 demonstrated significant prognostic prediction in the GEO and TCGA datasets (Fig. 6). Furthermore, OSescc could detect the variations of a prognostic biomarker among different cohorts. For example, RNASEN, which has previously been reported as a biomarker of worse prognosis in patients with ESCC (15), was also demonstrated to be associated with worse prognosis by OSescc in the TCGA dataset; however, the prognosis analysis based on the GSE53625 dataset was not statistically significant (Fig. 7).
Figure 6.
Kaplan-Meier plots for two previously published esophageal squamous cell carcinoma biomarkers cross-validated between TCGA and GSE53625 datasets. (A) Prognostic significance of gene ABL1 in GSE53625. (B) Prognostic significance of gene ABL1 in TCGA. (C) Prognostic significance of gene TCEAL1 in GSE53625. (D) Prognostic significance of gene TCEAL1 in TCGA. ESCC cases in TCGA, n=85 and in GSE53625, n=179. TCGA, The Cancer Genome Atlas; ABL1, ABL proto-oncogene 1, non-receptor tyrosine kinase; TCEAL1, transcription elongation factor A protein like 1; HR, hazard ratio; CI, confidence interval; OSecc, online consensus survival tool for esophageal squamous cell carcinoma; ESCC, esophageal squamous cell carcinoma.
Figure 7.
Kaplan-Meier plots for gene RNASEN reveal different prognostic significance between TCGA and GSE53625 datasets. (A) Prognostic significance of RNASEN in TCGA. (B) Prognostic significance of RNASEN in GSE53625. ESCC cases in TCGA, n=85 and in GSE53625, n=179. RNASEN, ribonuclease 3; TCGA, The Cancer Genome Atlas; HR, hazard ratio; CI, confidence interval; OSecc, online consensus survival tool for esophageal squamous cell carcinoma; ESCC, esophageal squamous cell carcinoma.
Discussion
Discovering prognostic biomarkers is important for translational cancer research. The present study analyzed two large gene expression profiling datasets of patients with ESCC with comprehensive clinical follow-up information from GEO and TCGA, and developed OSescc, an online free tool, to assess the prognostic power of potential prognostic biomarkers.The advantage of OSescc is its convenience to assess and validate potential prognostic biomarkers for ESCC. It can assist clinicians and biologists to determine the prognostic power of given genes in an easy and interactive way. The disadvantage of current study is the sample size; only two available cohorts were implemented in OSescc. This is one reason why the OSescc database was generated to collect all available ESCC clinical follow-up information and gene expression profiling data to facilitate the prognosis analysis for other researchers. When novel ESCC clinical and profiling data becomes available, they will be add to the OSescc database to improve its power and reliability. OSescc was developed to assist researchers and clinicians to investigate the prognostic value of interested genes specific to patients with ESCC. So far, the clinical and gene expression data collected from GEO and TCGA only contained patients with ESCC. In addition to OSescc, we are working on developing online survival tools for other types of cancer and these prognostic tools for other types of cancer should be publically accessible in the future.Furthermore, the prognosis of patients with ESCC had been demonstrated to be highly dependent on biological factors of the patient, including immune function, nutrition, alcohol drinking and smoking status (39,40); however, the aim of current study was to evaluate the association between gene and prognosis. Nevertheless, certain factors, including TNM stage, smoking, alcohol, lymph node status, country and ethnicity, were also implemented in OSescc to facilitate users to limit their prognostic analysis in particular groups of ESCC. In summary, OSescc may become a guidance tool for selecting suitable prognostic markers for ESCC.
Authors: Roman Rouzier; Charles M Perou; W Fraser Symmans; Nuhad Ibrahim; Massimo Cristofanilli; Keith Anderson; Kenneth R Hess; James Stec; Mark Ayers; Peter Wagner; Paolo Morandi; Chang Fan; Islam Rabiul; Jeffrey S Ross; Gabriel N Hortobagyi; Lajos Pusztai Journal: Clin Cancer Res Date: 2005-08-15 Impact factor: 12.531
Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend Journal: Nature Date: 2002-01-31 Impact factor: 49.962
Authors: B Michael Ghadimi; Marian Grade; Michael J Difilippantonio; Sudhir Varma; Richard Simon; Cristina Montagna; Laszlo Füzesi; Claus Langer; Heinz Becker; Torsten Liersch; Thomas Ried Journal: J Clin Oncol Date: 2005-03-20 Impact factor: 44.544
Authors: Marc J van de Vijver; Yudong D He; Laura J van't Veer; Hongyue Dai; Augustinus A M Hart; Dorien W Voskuil; George J Schreiber; Johannes L Peterse; Chris Roberts; Matthew J Marton; Mark Parrish; Douwe Atsma; Anke Witteveen; Annuska Glas; Leonie Delahaye; Tony van der Velde; Harry Bartelink; Sjoerd Rodenhuis; Emiel T Rutgers; Stephen H Friend; René Bernards Journal: N Engl J Med Date: 2002-12-19 Impact factor: 91.245
Authors: Jacques Lapointe; Chunde Li; John P Higgins; Matt van de Rijn; Eric Bair; Kelli Montgomery; Michelle Ferrari; Lars Egevad; Walter Rayford; Ulf Bergerheim; Peter Ekman; Angelo M DeMarzo; Robert Tibshirani; David Botstein; Patrick O Brown; James D Brooks; Jonathan R Pollack Journal: Proc Natl Acad Sci U S A Date: 2004-01-07 Impact factor: 11.205
Authors: David G Beer; Sharon L R Kardia; Chiang-Ching Huang; Thomas J Giordano; Albert M Levin; David E Misek; Lin Lin; Guoan Chen; Tarek G Gharib; Dafydd G Thomas; Michelle L Lizyness; Rork Kuick; Satoru Hayasaka; Jeremy M G Taylor; Mark D Iannettoni; Mark B Orringer; Samir Hanash Journal: Nat Med Date: 2002-07-15 Impact factor: 53.440
Authors: Jenny C Chang; Eric C Wooten; Anna Tsimelzon; Susan G Hilsenbeck; M Carolina Gutierrez; Richard Elledge; Syed Mohsin; C Kent Osborne; Gary C Chamness; D Craig Allred; Peter O'Connell Journal: Lancet Date: 2003-08-02 Impact factor: 79.321