Literature DB >> 27322207

Assessing the clinical utility of genomic expression data across human cancers.

Xinsen Xu¹, Lei Huang², Chun Hei Chan³, Tao Yu⁴, Runchen Miao¹, Chang Liu¹.

Abstract

Cancer molecular profiling provides better understanding of tumor mechanisms and helps to improve the existing cancer management. Here we present the gene expression signatures from ~9000 human tumors with clinical information across 32 malignancies from The Cancer Genome Atlas project (TCGA). Major predictors from the RNA sequencing data that were significantly correlated with cancer survival were identified. The expression level of these prognostic genes revealed significant genomic pathways that were clinically relevant to survival outcomes across human cancers. Furthermore, it is shown that in most cancer types, combinations of these genomic signatures with clinical information might yield improved predictions. Thus, with respect to clinical utility, our study reveals the promising values of genomic data from the pan-cancer perspective.

Entities: Chemical Disease Gene Mutation Species

Keywords: cancer; expression; genome; prognosis; utility

Mesh：

Substances：
Biomarkers, Tumor

Year: 2016 PMID： 27322207 PMCID： PMC5216771 DOI： 10.18632/oncotarget.10002

Source DB: PubMed Journal: Oncotarget ISSN： 1949-2553

INTRODUCTION

Cancer is a global health burden and the second leading cause of death [1]. Despite various detection method and treatment options, survival rates for most cancers are still very low. Genomic features emerge as promising biomarkers for cancer [2]. Of the various molecular data, the gene expression value from RNA sequencing revealed detailed molecular features with prognostic associations [3]. However, due to high cost, most studies only focused on several pre-selected genes, or based on small sample sizes [4, 5]. The cancer genome atlas (TCGA) project “motivated large-scale genomic efforts to obtain the complete catalogs of the genomic alterations in cancer” [6]. Besides the rich molecular features (genomic, transcriptomic, epigenomic and proteomic) of each tumor, it also provides valuable clinical information. However, the clinical utility of these data has not been fully elucidated. In the present study, we depicted the global pan-cancer prognostic landscape by analyzing the expression signatures from ~9000 human tumors across 32 malignancies from TCGA data sets. Furthermore, the clinical utility of survival predictions was evaluated by combining the genomic data with clinical information.

RESULTS

Patient characteristics and outcome

Patient information with complete RNA sequencing data and clinical data of all TCGA cancer types were collected (32 tumor types) (adrenocortical carcinoma, ACC; bladder urothelial carcinoma, BLCA; breast invasive carcinoma, BRCA; cervical and endocervical cancers, CESC; cholangiocarcinoma, CHOL; colon adenocarcinoma, COAD; lymphoid neoplasm diffuse large B-cell lymphoma, DLBC; esophageal carcinoma, ESCA; glioma, GBMLGG; head and neck squamous cell carcinoma, HNSC; kidney chromophobe, KICH; kidney renal clear cell carcinoma, KIRC; kidney renal papillary cell carcinoma, KIRP; acute myeloid leukemia, LAML; liver hepatocellular carcinoma, LIHC; lung adenocarcinoma, LUAD; lung squamous cell carcinoma, LUSC; mesothelioma, MESO; ovarian serous cystadenocarcinoma, OV; pancreatic adenocarcinoma, PAAD; pheochromocytoma and paraganglioma, PCPG; prostate adenocarcinoma, PRAD; rectum adenocarcinoma, READ; sarcoma, SARC; skin cutaneous melanoma, SKCM; stomach adenocarcinoma, STAD; testicular germ cell tumors, TGCT; thyroid carcinoma, THCA; thymoma, THYM; uterine corpus endometrial carcinoma, UCEC; uterine carcinosarcoma, UCS; uveal melanoma, UVM). Data analysis steps and clinical characteristics of the cancer patients were shown in Figure 1A-1E. In total, there were 9175 patients of the 32 tumor types, in which, 49.4% were male and 50.6% were female. Their median age was 60 years old. With respect to tumor stage, 30.7% of the patients were in stage 1, 30.9% were in stage 2, 26.7% were in stage 3, and 11.7% were in stage 4. At the time of analysis, 77.4% of the patients remained alive, and 22.6% were deceased.

Figure 1

Overview of the computational approach and patient characteristics

A. Flow diagram summarizing the data processing and analysis steps. B. Number of patient samples with survival data, organized by cancer types. C. Median age of the patients in different cancer types. D. Median survival time of the patients in different cancer types (some of the cancer types don't have enough death events to calculate the median survival times, either because of the high survival rates or due to the small sample size of the cancer type). E. Frequency distributions of gender, tumor stage and survival outcome in the whole cancer population.

Overview of the computational approach and patient characteristics

Pan-cancer prognostic genes and risk scores

To explore the pan-cancer prognostic signatures, pan-cancer dataset was built by combining all the cancer patients. Samples were randomly assigned into two groups, where 80% of the samples were assigned as the training group and 20% as the testing group. By cox regression analysis for the training group, the top ten adverse genes (B3GNT5, SLC11A1, ELF4, GALNT2, PA2G4P4, SKP2, S100A9, FOXM1, PSMB2, ARL6IP6) and top ten favorable prognostic genes (TADA2B, CBX7, CIRBP, MAGED2, CRY2, CREBL2, TMED8, XPC, SECISBP2, GPD1L) were identified (Figure 2A). Based on these top prognostic genes, risk scores were calculated (Table 1). The risk score was defined as the weighted sums of the independent prognostic gene values (1 for high expression, and 0 for low expression), weighted with their regression coefficients from the cox models (Figure 2B).

Figure 2

Prognostic landscape of gene expression in the whole cancer population

Table 1

Specific risk scores for different types of cancer

Cancer Type	Risk Score
Whole population (binary)	Score = 0.66B3GNT5+0.65SLC11A1+0.65ELF4+0.65GALNT2+0.63PA2G4P4+0.63SKP2+0.60S100A9+0.63FOXM1+0.61PSMB2+0.64ARL6IP6-0.63GPD1L-0.62SECISBP2-0.58XPC-0.61TMED8-0.61CREBL2-0.64CRY2-0.64MAGED2-0.68CIRBP-0.69CBX7-0.71TADA2B
Whole population(continuous)	Score = 0.09B3GNT5+0.17SLC11A1+0.29ELF4+0.12GALNT2+0.18PA2G4P4+0.16SKP2+0.08S100A9+0.14FOXM1+0.04PSMB3+0.23ARL6IP6-0.34GPD1L-0.35SECISBP2-0.38XPC-0.26TMED8-0.36CREBL2-0.37CRY2-0.46MAGED2-0.54CIRBP-0.48CBX7-0.42TADA2B
ACC	Score = 2.48MASTL+2.38RECQL4+2.33PRC1+2.36KIF11+2.65AMMECR1L+2.55TRIP13+2.54MKI67+2.59NCAPD3+2.07E2F1+2.06FANCI-2.07APH1B-2.3CTSA-2.04UPRT-2.02HNRNPH2-2.28NDRG4-2.47PPFIBP2-1.94LACTB-2.43PTGR2-2.11CHIC1-2.36BDH2
BLCA	Score = 0.81SPNS1+0.77GARS+0.78NBAS+0.77IFT122+0.75NOMO1+0.74TMX2+0.73DHRS4+0.74CCDC28B+0.74TMEM109+0.74DAD1-0.86GATA2-0.77TRIM26-0.76MRPS6-0.77YDJC-0.76ZNF841-0.75ZBTB49-0.72ORMDL1-0.72DEDD2-0.72OGT-0.74CTSH
BRCA	Score = 0.88ZHX1+0.82PRRC1+0.82SCRN1+0.81IARS+0.81PTPN11+0.83VPS35+0.78MRS2+0.77GRPEL2+0.79TMEM65+0.76PGK1-0.95TNFRSF14-0.95KDM4B-0.92INO80B-0.88LOC150776-0.86MRPL23-0.87PYCARD-0.84ABHD14A-0.82FGD3-0.84SEC14L2-0.79NFKBIA
CESC	Score = 1.14PHRF1+1.09TNRC18+1.06ITGA5+0.98DBN1+1.01LATS2+1.01TOR1AIP2+1.02FASN+1URGCP+0.95SRI+0.95ADAM9-1.28TREX1-1.21RBM38-1.07LGALS9-1.04HNRNPA3-0.97NQO2-0.96ZER1-0.97ISCU-0.94MTCP1NB-0.96AKR1A1-0.98SLC25A28
CHOL	Score = 1.91EIF5A+2CEBPB+1.9SCO1+1.89ROM1+2.32SRI+1.69FAM54B+1.62MNAT1+1.58PSEN1+1.51PDHB+1.66SLC38A6-1.63SCRN1-1.95PGPEP1-1.83EIF4ENIF1-1.62SGSH-1.63VSIG10-1.49ACBD5-1.47PURB-1.61TNFAIP8-1.57FUT4-1.38FGD6
COAD	Score = 1.24TIAL1+1.14SMNDC1+1.1KIAA0907+1.09POLR2J4+1.03HSPA1L+0.94ZBTB25+0.95UBN2+0.95SCRN3+0.98ZBTB9+0.93DNAJB6-0.99CPT2-1.01MRPL37-0.99ATP8B1-0.96CCDC149-0.92EIF2C1-0.9DYNLL2-0.96ZCCHC11-0.91MFN2-1.01GSR-0.9SAMM50
DLBC	Score = 1.67ELP4+1.48API5+1.48ARHGEF7+1.48ATXN7L2+1.48EXOC5+1.48GMEB1+1.48MEMO1+1.48MPHOSPH10+1.48MTOR+1.48NEO1-1.48TBKBP1-1.48STXBP2-1.48PUS1-1.48PTRH1-1.48POLR3D-1.48KCNK6-1.48IFI35-1.48GPAA1-1.48FHL3-1.48FBXW5
ESCA	Score = 1.13B3GALTL+1.11PGK1+1.18GRPEL2+1.17MAPRE1+1.03SRXN1+1.02LRRC58+0.99NFATC3+0.96ST13+0.94TRMT6+0.92MLLT11-1.02UNC13D-0.98PCSK7-0.94PLCD3-1.01DIP2A-0.98PLEKHM1P-0.89UNC93B1-0.87ERAP2-0.84LRCH4-0.86CCBL2-0.84C10orf54
GBMLGG	Score = 1.98GLA+1.83KDELC2+2.01WEE1+1.88EMP3+1.8DUSP10+1.84CLIC1+1.88TIMP1+1.84CD58+1.79DDB2+1.81SHISA5-2.01ZRANB1-1.9GLUD1-1.88FAM190B-1.78RAP2A-1.79ADD1-1.77HDAC4-1.83ARL3-1.74PATZ1-1.79SCAPER-1.73RPL7
HNSC	Score = 0.9PGK1+0.8USP10+0.78TOMM34+0.8SNX6+0.72TMED2+0.7PDIA3P+0.69ADK+0.71USP14+0.69TRIM32+0.68HPRT1-0.75ZNF266-0.69ZNF700-0.64AHCYL2-0.65SH3BP2-0.65ZNF577-0.64ZNF557-0.64ATXN7L2-0.64ZNF20-0.63DUSP16-0.63CDK3
KICH	Score = 2.36PNPT1+2.33PTP4A2+2.31GPN1+2.31GPATCH2+2.3PLEKHA2+2.29NRAS+2.27PDS5A+2.27KDM1B+2.27TTF2+2.26NT5DC3-2.36FIZ1-2.34TST-2.34C14orf1-2.34ELAVL1-2.33KLHL26-2.31CES2-2.31CTDP1-2.31SUSD1-2.3USF2-2.3COPS7A
KIRC	Score = 1.28DONSON+1.24STRADA+1.2ATP13A1+1.19NOP56+1.18CARS+1.18ANAPC7+1.16ANAPC5+1.14SBNO2+1.15NCLN+1.18FKBP11-1.16SGCB-1.15PINK1-1.12FBXO3-1.11SSFA2-1.1ITGA6-1.01HBP1-1FBXL3-1.02RNF20-1PURA-0.98FBXL5
KIRP	Score = 1.7GLT25D1+1.48LMNB2+1.54SPAG5+1.96ADA+1.41PUS7+1.61CCNF+1.5RHBDF2+1.7P4HB+1.58TSEN15+1.41AEBP1-1.65TMCO4-1.49PGPEP1-1.63FBXL5-1.51HTATSF1-1.56CCDC71-1.56ACTR8-1.36CC2D2A-1.42PARP3-1.39ZBTB3-1.39SLC25A11
LAML	Score = 1.1TOMM40L+0.95NUP210+0.91PARP3+0.83DDIT4+0.83CLCN5+0.79FIBP+0.78RPS6KA1+0.77PSMA7+0.76RINL+0.76PARVB-0.99PWWP2A-0.97MBTPS1-0.87NHLRC3-0.87LOC646762-0.86ADSS-0.84TGIF1-0.81SIAH1-0.83DET1-0.8KCTD15-0.79FCHSD2
LIHC	Score = 0.97HNRNPH1+0.81N4BP3+0.82LDHA+0.81ZCRB1+0.84YBX1+0.78STK39+0.78ATP6V1E1+0.8ANXA5+0.78HN1+0.76ATP1B3-0.81STAT5B-0.79C9orf3-0.79CHST14-0.76SIK2-0.72POLDIP2-0.73ATF7IP2-0.72SLC23A2-0.67STIM1-0.65MIA3-0.65PSD4
LUAD	Score = 0.72ITGA6+0.73C1QTNF6+0.72MTHFD1+0.7DNAJB4+0.7BACH1+0.69CCNA2+0.65EXT1+0.65FSCN1+0.66DNAJB6+0.65NOC3L-0.91SLC25A42-0.82PRKCD-0.79DBP-0.75DENND1C-0.71NRL-0.72C19orf42-0.73ALAD-0.71SLC11A2-0.68ABAT-0.67FAM117A
LUSC	Score = 0.74CD14+0.66ARHGAP1+0.63CD151+0.62FSTL3+0.6RALGAPA2+0.59CST3+0.57C11orf2+0.56SNX29+0.56FAM109B+0.54EHD1-0.69ERH-0.65NDUFB1-0.59CBX1-0.56EMD-0.55RLIM-0.53FAM103A1-0.53MNAT1-0.53VRK1-0.51SS18L2-0.5FKBP3
MESO	Score = 1.71CDCA8+1.63KPNA2+1.62SPAG5+1.54CCNA2+1.64IQGAP3+1.66FOXM1+1.5HMGB2+1.51MAD2L1+1.52CDCA5+1.58PRC1-1.53KLHL9-1.44ETAA1-1.41THTPA-1.36HIST1H2BD-1.32FOXO4-1.39FBXO44-1.28HIST1H2AC-1.39HIST1H2BK-1.3SH3BGRL-1.36TMBIM4
OV	Score = 0.6CBLL1+0.59CACNA1C+0.56SOCS5+0.54ZNF384+0.54CACNB1+0.53SEMA4F+0.52AGPAT6+0.52CHKA+0.54GLIS2+0.52GLCE-0.77NPEPL1-0.6TLCD1-0.57LMO4-0.55CASP6-0.54ISG20-0.55AP4B1-0.53SAT1-0.52ZNF326-0.51ENSA-0.5AP1S2
PAAD	Score = 1.31ATG12+1.3ASCC1+1.33NFE2L3+1.31KIAA1609+1.3CCDC6+1.2EIF2A+1.26TMOD3+1.21AP3S1+1.24METAP1+1.22NCK1-1.33USP20-1.27MUM1-1.27REC8-1.24RBM6-1.21ARMC5-1.23DEF8-1.27KLHL22-1.13C7orf43-1.14MGC23284-1.1ELMOD3
PCPG	Score = 2GLE1+1.99EFTUD1+1.99NARG2+1.98CIZ1+1.97ZNF490+1.97TTC9C+1.96FAM178A+1.96ABCA1+1.95AKAP13+1.95LOC642852-2.03HMOX2-1.96DGCR14-1.96SLC10A3-1.95ITFG3-1.94FAM118A-1.93MBD3-1.93USE1-1.92ICOSLG-1.91FSCN1-1.91TMEM167B
PRAD	Score = 20.3EXTL2+20.3B3GNT5+20.3SEMA4C+20.3NUDCD2+20.3GNAI1+20.3THUMPD1+20.3CNNM3+20.3RNF138+20.3PRPF4+20.3FASTKD3-20.3MRM1-20.3DAP-20.3PAOX-20.3PLA2G15-20.3SBNO2-20.3STK19-20.3CCDC85C-20.3TBXAS1-20.3NFATC1-20.3HSD17B7
READ	Score = 2.26PSMA3+2.84PHLPP1+2.91CNDP2+2.96CORO1A+2.2AKR7A2+2.82SSBP2+2.82TMEM173+2.7ATP6V0C+2.7NFYC+2.79B4GALT3-2.28OSGEPL1-2.96PHF20-3.01ANKRD27-2.95ZNF853-2.95RAPGEF2-2.88SETD2-2.87MSH6-2.85ATM-3.01SIRT5-2.81SGK3
SARC	Score = 1.12RLIM+1.03BAIAP3+0.97FUBP1+1ZNF146+0.99ATXN10+0.95LRRC41+0.93LRRC47+0.9DOCK7+0.9ZNF697+0.89LAPTM4B-1.13TRIM21-1.08B3GALT4-1.04CCDC69-1.01CCNDBP1-0.97C14orf159-0.95GALK2-0.91PARP14-0.91ATP2A3-0.86C15orf24-0.84PPAP2A
SKCM	Score = 0.75HN1L+0.7GATAD2A+0.68NT5DC2+0.66VDAC1+0.65KPNA2+0.62FOXM1+0.62DCTN2+0.61CDC25A+0.6SLC25A3+0.61SLC25A15-0.81GBP2-0.77APOL6-0.77IFITM1-0.75FCGR2A-0.74FAM96A-0.72PARP9-0.72APOBEC3F-0.71NXT2-0.7UBA7-0.7APOL1
STAD	Score = 2.33SLC9A3R2+1.91ITPRIP+1.74SOCS2+2.04C1orf144+1.7LOC282997+1.69BMP2K+2.64VPS52+1.93UBE4B+1.51CXCR7+1.9NDUFA11-2.55SLC33A1-2.27TMEM66-2.18UBA5-2.14CD47-1.8C21orf59-2.03NSF-2.01FUNDC1-1.97RAB1A-1.69C14orf142-1.95PFDN4
TGCT	Score = 1.03FAM177A1+1.03NBR2+1ATAD2B+0.98C8orf73+0.96FMNL2+0.95CEBPA+0.95VCPIP1+0.94C12orf23+0.94LMBR1L+0.94ABCC5-1.06MYO1E-1.03CABLES1-1FAM84B-0.99TOP1-0.98NCSTN-0.97NAIF1-0.97IRS2-0.97HIBADH-0.97FUBP3-0.97PGM1
THCA	Score = 2.06IQSEC1+1.99FLYWCH1+1.88ZHX3+2.12SEMA6A+1.78FTO+1.76LARS+1.74TGFBR3+2.03PTEN+1.72ZNF324+2.72CEP250-1.96ANXA1-1.89SEC14L2-2.18CIR1-2.17MED17-2.15ITGB1BP1-1.86SRP68-2.14VAMP8-2.08PSME2-2.77RPS27-1.73CLU
THYM	Score = 2.5RARG+2.45RBM47+2.39PELI3+2.39ATP1B1+2.39TST+2.35NUDT16+2.38DENND1A+2.35PPAPDC1B+2.34GNS+2.3TBC1D16-2.44ADRBK1-2.43PDSS1-2.41SEMA4D-2.4INTS8-2.4VRK1-2.4PTP4A2-2.39CUTC-2.39SEMA7A-2.38SCLT1-2.37ANKRD27
UCEC	Score = 2.12TUBB2A+1.92TAOK3+2.11ENDOD1+2.05KLF11+2.06SYNPO+2.02BRAF+2.02SYTL2+2.05SPAG5+1.72MCL1+1.73ARMC1-1.81SETD6-1.8LYRM1-1.58PYCRL-1.58YDJC-1.71CRBN-1.94C15orf29-1.52PHF5A-2.57PPA1-1.56WWOX-1.64IFT140
UCS	Score = 1.27S100A10+1.23PDE4A+1.21STMN3+1.22ARL4D+1.18HIBCH+1.16FN3K+1.2SEC23B+1.12NINJ1+1.16LOC728554+1.16CTU1-1.86CBX5-1.32DNMT3A-1.31PSMD7-1.35PCBP2-1.25C2orf68-1.18BUD13-1.17ZNRF1-1.21SSRP1-1.19ST3GAL2-1.22TUT1
UVM	Score = 2.32GTF3A+3.14PSTPIP2+2.27SPAG1+3.03SFT2D2+2.23LIPA+2.2IMPA1+2.21JTB+2.16COQ2+2.93ALG5+2.97ISG20-3.14RABL2B-2.26C16orf86-2.19CNP-3.01C3orf39-2.19C3orf37-2.17TBKBP1-2.14TOM1L2-2.17RPL32P3-2.89PPP2R3B-2.16QRICH1

Prognostic landscape of gene expression in the whole cancer population

A. Top ten adverse and favorable pan-cancer prognostic genes were identified in the training group, ranked by the z scores. B. Risk score calculated by the top prognostic genes in the training group patients. Upper panel: risk-score distribution of the training group patients and survival status (blue indicates alive, and red indicates dead). Lower panel: heatmap showing the expression level of the top prognostic genes. C. Box plots of risk scores in different age groups, different gender groups, and different stage groups in the training group patients. D. Forest plot of risk score association with cancer mortality in the training group patients of different stages. E. Kaplan-Meier estimates of overall survival according to the risk score in the training set. F. Box plots of risk scores in different age groups, different gender groups, and different stage groups in the testing group patients. G. Forest plot of risk score association with cancer mortality in the testing group patients of different stages. H. Kaplan-Meier estimates of overall survival according to the risk score in the testing set. To assess the clinical utility of the risk score, correlation of the risk score with the clinical variables in the training group was explored. In the analysis, higher scores were associated with male patients, patients with older age, and patients with advanced tumor stages (Figure 2C). Further cox analysis and log rank test also confirmed that the poor survival outcome in patients with higher risk scores in different tumor stages (Figure 2D–2E). In the testing group, similar relationship between the risk score and clinical variables was also shown (Figure 2F). Notably, with respect to survival analysis, higher risk scores in the testing group also indicated higher risks of prognosis, suggesting that the risk score showed valuable clinical utility (Figure 2G–2H).

Evaluation of the prognostic genes and risk scores

The RNA-seq data demonstrated great value for cancer prognosis. Risk scores specific for each type of cancer were shown in Table 1, which were calculated by applying the method used in the whole cancer population. However, at this moment, these prognostic models are limited by the sample size to be of clinical value. To evaluate the effect of different normalization method for the RNA-seq data, quantile data were transformed into z-score or being applied the voom normalization. As shown in supplementary Figure 1A-1B, after quantile normalization of the RNA-seq data, the z-score transformation or voom normalization doesn't change much of the prognostic genes (based on z values of cox regression), with the pearson r value of 0.98 and 0.97, respectively. After various normalization method, top prognostic genes remained mostly the same, which was shown in the supplementary Figure 1C. Thus, applying Z-score transformation or voom normalization yield limited value for the survival analysis. For the above prognostic model, the high or low expression for prognostic model was determined by the median expression level of each gene. The gene was divided as binary categorization such as 1 for high expression (> median value) and 0 for low expression (< median value). Here we also applied the z-score (continuous variable) directly to propose a prognostic model that can reflect the values of gene expression (Table 1). Cox regression results showed that the continuous prognostic model have an hazard ratio of 1.22, which means the death risk increases by 22% if the patients get a risk score increased by 1. The prognostic genes (in each cancer type and in the whole cancer population) were filtered by a specific cutoff (|z| > 3.09, or nominal one-sided p < 0.001). As an investigation of the relationship of different prognostic gene across different cancer types, prognostic genes in each cancer type were compared with the whole cancer population. As shown in the supplementary Figure 1D, for the prognostic gene identified in the study on single cancer type, most of them were also found in the pooled analysis. For example, 66% of the prognostic genes in the ACC also had prognostic values in the whole cancer population.

Pathway analysis in patients with different prognosis

Based on the prognostic risk score, patients were stratified into two different survival groups of a positive risk score and a negative risk score. This unsupervised cluster analysis showed obvious distinctions between the stratified survival groups, both in the training group and the testing group (Figure 3A, 3E). To link the observed gene expression changes with molecular pathways that may impact the differential survival between high- and low-risk groups, gene set enrichment analysis (GSEA) was performed. As shown in Figure 3B and 3F, pathways such as E2F targets, MYC targets, G2M checkpoint, mTORC1 signaling and interferon gamma response were significantly enriched in the patients of higher risk scores, with good consistency between the training group and the testing group.

Figure 3

Prognostic landscape of pathway scores in the whole cancer population

Prognostic landscape of pathway scores in the whole cancer population

A, E. Heatmap depicting gene expression levels after unsupervised hierarchical clustering in the training set and testing set, respectively. Expression levels are indicated on a low-to-high scale (green-black-red). Two clusters are defined, namely the high risk group and low risk group. B, F. GSEA analysis was performed in the training set and testing set, respectively, to identify biological pathways associated with survival outcome. FWER-p values are indicated on a low-to high scale (lightblue-darkblue). The number of significant genes in each gene set is indicated by the circle size. C, G. Forest plots of pathway score association with cancer mortality in the training set and testing set, respectively. D, H. Scatter plots of correlations between risk scores and the E2F pathway scores in the training set and testing set, respectively. In order to assess possible effects of different pathways, the GSEA for every sample were evaluated using the single sample gene set enrichment analysis (ssGSEA). Based on the calculated scores for each pathway, cox analysis was performed to evaluate their prognostic effects. Results showed that most of the significant pathways from the GSEA output showed positive correlations with the survival outcome (Figure 3C, 3G). In addition to the cox analysis, positive correlations were detected between the pathway ssGSEA scores and the prognostic risk scores. In Figure 3D and 3H, correlation analysis were shown in the most significant pathway (E2F targets), in both the training group and testing group.

Assessment of prognostic power of gene expression data

Since the gene expression analysis and pathway analysis showed great prognostic values in the study, prognostic power of gene expression data were further explored. C-index was applied to assess the predictive power of the gene expression data alone or combined with clinical information. To improve accuracy, cancer types that don't have enough death events (< 20 deaths or < 10% mortality) were excluded. Cancer patients were randomly split into 80% training and 20% testing for 100 times to calculate the final C-index. As shown in Figure 4A and 4B, the predictive power of gene expression data alone varied across cancer types. In KIRC and GBMLGG, the prognostic power was much higher when compared with other cancer types.

Figure 4

C-indexes by models trained from individual gene expression data alone or in combination with clinical variables

C-indexes by models trained from individual gene expression data alone or in combination with clinical variables

A. C-indexes calculated from the ACC, BLCA, BRCA, CESC, COAD, ESCA, GBMLGG, HNSC, KIRC and KIRP. B. C-indexes calculated from the LAML, LIHC, LUAD, LUSC, MESO, OV, PAAD, SARC, SKCM and UCS. The lightblue box indicates the model built from individual gene expression data alone, and the darkblue box indicates the model built from the combination of gene expression data and clinical variables. To explore any additional prognostic power, the gene expression data was combined with clinical information. Significant clinical features (correlated with survival) were applied as baseline to build the cox model. A feature-selection step against the residuals was utilized to include the gene features that better fit the model. Results showed that the most gene expression data alone (18 out of 20 cases) had significant predictive power (C-index > 0.5). Incorporating clinical information to gene expression data statistically boosts the model performance in 12 cancer types (BLCA, BRCA, CESC, GBMLGG, KIRC, KIRP, LAML, LIHC, LUAD, LUSC, OV, SKCM) (p < 0.05) (Figure 4A, 4B).

DISCUSSION

In this study, we assessed the clinical utility of genomic expression data from ~9000 cancer patients of 32 tumor types. The prognostic power across different cancer types was also evaluated [7, 8]. Currently, only a few gene expression-based markers are routinely used in clinical practice [9-12]. The clinical utility of genomic expression has not been fully explored. Yuan et al. reported that for cancer patients, incorporating molecular features with clinical information yields significantly improved predictions. However, they only focused on 4 cancer types (KIRC, GBM, OV, LUSC), and no conclusions could be drawn for the whole cancer population [13]. Recently, Gentles et al. described the genomic prognostic landscape across human cancers, highlighting the promise of genomic expression data as biomarkers for clinical outcomes [14]. In our study, besides illuminating the prognostic landscape of genomic expression, pathway analysis based on these prognostic genes was also evaluated. In addition, the C-index was calculated from the prognostic models across tumor types, to assess the prognostic power of gene expression data. Based on the genomic expression data in the whole cancer population, the top prognostic genes were identified, such as FOXM1, CBX7, CREBL2 and SKP2, which were consistent with previous studies [14-16]. Notably, when building the risk scores based on these top prognostic genes, significant stratification in survival outcomes were shown, both in the training and validation cohorts, indicating the robustness of the predicting effect of the prognostic genes. Because of heterogeneity, many statistical methods have been developed to analyze cancer genomics, based on gene sets, pathways and network modules [17-19]. For the first time, our study described the prognostic landscape of biological pathways in the whole cancer population. Gene set enrichment of the differentially expressed genes revealed significant prognostic pathways, such as the E2F targets, MYC targets, G2M checkpoint, interferon gamma response, and so on. Mostly, these pathways are correlated with cell cycle, proliferation and inflammation, which is consistent with the biological mechanisms of tumor progression [20, 21]. To explore prognostic power, our results showed that combining the clinical and molecular information could improve the predictive power of the gene expression data in most cancer types. Although the absolute magnitude gains were limited, the gene-expression signatures provide new biological insights into the process of cancer progression and metastasis that can help to improve the prediction power [22]. Actually, some of the gene-based prognostic signatures have already been demonstrated to be clinically useful for predicting the risk of tumor recurrence, such as the 70-gene and 76-gene signatures in breast cancer [23-26]. It is also important to realize that gene expression information is just one of the abundant molecular data (genomic, transcriptomic, epigenomic and proteomic) revealing the biological complexity of cancer. Other molecular information will also improve our understanding of the genotype–phenotype relationships involved in cancer. On the other hand, regarding the reliability and the reproducibility of the clinical use of molecular data, future technology, statistical and analytical methods are in great need to catch up with clinical needs [22]. In conclusion, our gene analysis and pathway analysis showed significant values for the prediction of survival outcomes for cancer patients. Additionally, it was found that by combining clinical information with molecular data, the model performance could be boosted statistically in most cancer types. However, further efforts would be needed to generate prognostic models ready for clinical use in the future.

MATERIALS AND METHODS

Data set compilation

Clinical and survival data were acquired from the TCGA Data Portal (https://tcga-data.nci.nih.gov/tcga/). RNA sequencing data was obtained from the GDAC Firehose System (http://gdac.broadinstitute.org/). To maintain data consistency, only the RNA sequencing data from the platform of Illumina HiSeq 2000 RNA Sequencing V2 was included. Patients who have a complete clinical and RNA sequencing data were screened for further analysis. For each cancer data set, patients were split into two groups randomly: 80% as the training set and 20% as the testing set. For the pan-cancer study, all RNA sequencing data were combined by intersecting the common genes across different cancer types.

Prognostic genes and construction of the prognostic model

For RNA sequencing data, all “raw count” values were divided by the 75th percentile of the same patient (after removing zeros) and multiplied by 1000, to get the quantile normalization for survival analysis. Furthermore, quantile data were also transformed to the z-score or normalized by “voom” to evaluate the effects of different normalization method. The Z-score was calculated as “(tumor expression - mean expression in reference) / standard deviation of expression in reference”. The voom normalization was applied using the R package “limma”. It estimates the mean-variance relationship of the log-counts and generates a precision weight for each observation. The association of each gene expression with survival outcomes was assessed via cox proportional hazards regression using the ‘coxph’ function of the R ‘survival’ package. Cox coefficients, hazard ratios with 95% confidence intervals, p values, and z-scores were obtained for each array probe. Top prognostic genes were identified by the values of z-scores. Based on these top prognostic genes, risk scores were built and it was defined as the weighted sums of the independent prognostic gene values (1 for high expression, and 0 for low expression). They were weighted with their regression coefficients from the cox models. Based on the prognostic risk score, further cox regression analysis and correlation analysis with clinical variables were performed.

Differential expression analysis and clustering analysis

Differential expression analysis was done using R “limma” package. Based on the limma output for the most differentially expressed genes, unsupervised hierarchical clustering analysis was used to discover the gene expression patterns of these groups sharing common characteristics. Heatmap was constructed using the R “gplots” package.

Gene set enrichment analysis

Prognostic gene sets are groups of genes that share common biological function. The evaluation of prognostic gene sets was performed using gene set enrichment analysis (GSEA) [27], where gene sets were obtained from the Molecular Signatures Database (mSigDB) [28]. In addition, a variant of GSEA, termed single sample gene set enrichment analysis (ssGSEA) was applied to calculate separate enrichment scores for each pairing of a sample and gene set [29]. Further cox regression analysis and correlation analysis were performed based on the enrichment scores of each gene set.

Performance evaluation of gene expression data

Performance evaluation of gene expression data was conducted based on the method suggested by Yuan et al [13]. Firstly, univariate cox was applied to the training set to select the top features correlated with survival, which were then converged by the LASSO using the R package “glmnet”. The model was then applied to the testing set for prediction. Concordance index (C-index) was estimated from 100 randomizations using the R package “survcomp”. To explore the predictive power of integrating gene expression data with clinical information, we used the significant clinical features (correlated with survival) as baseline to build the cox model. Then a feature-selection step against the residuals was applied to combine the gene features that better fit the model.

29 in total

1. Next generation RNA-sequencing in prognostic subsets of chronic lymphocytic leukemia.

Authors: Larry Mansouri; Rebeqa Gunnarsson; Lesley-Ann Sutton; Adam Ameur; Sean D Hooper; Markus Mayrhofer; Gunnar Juliusson; Anders Isaksson; Ulf Gyllensten; Richard Rosenquist
Journal: Am J Hematol Date: 2012-06-03 Impact factor: 10.047

Review 2. Biomarkers in cancer staging, prognosis and treatment selection.

Authors: Joseph A Ludwig; John N Weinstein
Journal: Nat Rev Cancer Date: 2005-11 Impact factor: 60.716

3. Analyzing gene expression data in terms of gene sets: methodological issues.

Authors: Jelle J Goeman; Peter Bühlmann
Journal: Bioinformatics Date: 2007-02-15 Impact factor: 6.937

4. Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series.

Authors: Christine Desmedt; Fanny Piette; Sherene Loi; Yixin Wang; Françoise Lallemand; Benjamin Haibe-Kains; Giuseppe Viale; Mauro Delorenzi; Yi Zhang; Mahasti Saghatchian d'Assignies; Jonas Bergh; Rosette Lidereau; Paul Ellis; Adrian L Harris; Jan G M Klijn; John A Foekens; Fatima Cardoso; Martine J Piccart; Marc Buyse; Christos Sotiriou
Journal: Clin Cancer Res Date: 2007-06-01 Impact factor: 12.531

5. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

6. The prognostic landscape of genes and infiltrating immune cells across human cancers.

Authors: Andrew J Gentles; Aaron M Newman; Chih Long Liu; Scott V Bratman; Weiguo Feng; Dongkyoon Kim; Viswam S Nair; Yue Xu; Amanda Khuong; Chuong D Hoang; Maximilian Diehn; Robert B West; Sylvia K Plevritis; Ash A Alizadeh
Journal: Nat Med Date: 2015-07-20 Impact factor: 53.440

Review 7. Cell-cycle-dependent regulation of DNA replication and its relevance to cancer pathology.

Authors: Kiku-E K Tachibana; Michael A Gonzalez; Nicholas Coleman
Journal: J Pathol Date: 2005-01 Impact factor: 7.996

Review 8. Clinical analysis and interpretation of cancer genome data.

Authors: Eliezer M Van Allen; Nikhil Wagle; Mia A Levy
Journal: J Clin Oncol Date: 2013-04-15 Impact factor: 44.544

9. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer.

Authors: Marc Buyse; Sherene Loi; Laura van't Veer; Giuseppe Viale; Mauro Delorenzi; Annuska M Glas; Mahasti Saghatchian d'Assignies; Jonas Bergh; Rosette Lidereau; Paul Ellis; Adrian Harris; Jan Bogaerts; Patrick Therasse; Arno Floore; Mohamed Amakrane; Fanny Piette; Emiel Rutgers; Christos Sotiriou; Fatima Cardoso; Martine J Piccart
Journal: J Natl Cancer Inst Date: 2006-09-06 Impact factor: 13.506

10. Assessing the clinical utility of cancer genomic and proteomic data across tumor types.

Authors: Yuan Yuan; Eliezer M Van Allen; Larsson Omberg; Nikhil Wagle; Ali Amin-Mansour; Artem Sokolov; Lauren A Byers; Yanxun Xu; Kenneth R Hess; Lixia Diao; Leng Han; Xuelin Huang; Michael S Lawrence; John N Weinstein; Josh M Stuart; Gordon B Mills; Levi A Garraway; Adam A Margolin; Gad Getz; Han Liang
Journal: Nat Biotechnol Date: 2014-06-22 Impact factor: 54.908

2 in total

1. Developing a genetic signature to predict drug response in ovarian cancer.

Authors: Stephen Hyter; Jeff Hirst; Harsh Pathak; Ziyan Y Pessetto; Devin C Koestler; Rama Raghavan; Dong Pei; Andrew K Godwin
Journal: Oncotarget Date: 2017-12-26

2. Comprehensive analysis of the tumor immune micro-environment in non-small cell lung cancer for efficacy of checkpoint inhibitor.

Authors: Jeong-Sun Seo; Ahreum Kim; Jong-Yeon Shin; Young Tae Kim
Journal: Sci Rep Date: 2018-10-01 Impact factor: 4.379

2 in total