| Literature DB >> 35565241 |
Patrick Terrematte1,2, Dhiego Souto Andrade1, Josivan Justino1,3, Beatriz Stransky1,4, Daniel Sabino A de Araújo1, Adrião D Dória Neto1,5.
Abstract
Patients with clear cell renal cell carcinoma (ccRCC) have poor survival outcomes, especially if it has metastasized. It is of paramount importance to identify biomarkers in genomic data that could help predict the aggressiveness of ccRCC and its resistance to drugs. Thus, we conducted a study with the aims of evaluating gene signatures and proposing a novel one with higher predictive power and generalization in comparison to the former signatures. Using ccRCC cohorts of the Cancer Genome Atlas (TCGA-KIRC) and International Cancer Genome Consortium (ICGC-RECA), we evaluated linear survival models of Cox regression with 14 signatures and six methods of feature selection, and performed functional analysis and differential gene expression approaches. In this study, we established a 13-gene signature (AR, AL353637.1, DPP6, FOXJ1, GNB3, HHLA2, IL4, LIMCH1, LINC01732, OTX1, SAA1, SEMA3G, ZIC2) whose expression levels are able to predict distinct outcomes of patients with ccRCC. Moreover, we performed a comparison between our signature and others from the literature. The best-performing gene signature was achieved using the ensemble method Min-Redundancy and Max-Relevance (mRMR). This signature comprises unique features in comparison to the others, such as generalization through different cohorts and being functionally enriched in significant pathways: Urothelial Carcinoma, Chronic Kidney disease, and Transitional cell carcinoma, Nephrolithiasis. From the 13 genes in our signature, eight are known to be correlated with ccRCC patient survival and four are immune-related. Our model showed a performance of 0.82 using the Receiver Operator Characteristic (ROC) Area Under Curve (AUC) metric and it generalized well between the cohorts. Our findings revealed two clusters of genes with high expression (SAA1, OTX1, ZIC2, LINC01732, GNB3 and IL4) and low expression (AL353637.1, AR, HHLA2, LIMCH1, SEMA3G, DPP6, and FOXJ1) which are both correlated with poor prognosis. This signature can potentially be used in clinical practice to support patient treatment care and follow-up.Entities:
Keywords: clear cell renal cell carcinoma (ccRCC); feature selection; gene signature; kidney cancer; machine learning; mutual information; prognosis; survival analysis
Year: 2022 PMID: 35565241 PMCID: PMC9103317 DOI: 10.3390/cancers14092111
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.575
Figure 1Flowchart of the current study to obtain a gene signature based on mutual information, Minimum Redundancy Maximum Relevance (mRMR). The datasets are indicated by the cylinder, white rectangles represent a step of the analysis, and the blue rectangles indicate the resulting figures and tables. TCGA-KIRC and ICGC-RECA are datasets of ccRCC.
Figure A1Number of papers published on PubMed by year on query performed in January 2021. Initially, in green, the gene signatures published in the period of 2015 to 2020 were selected to be compared. After the exclusion criteria, we obtained the 14 gene signatures.
Gene signatures of ccRCC after exclusion criteria. The PubMed query was conducted in January 2021 using the terms: (renal OR kidney) AND (clear cell) AND (cancer) AND (prognosis OR survival OR outcomes) AND (gene signature AND regression), and filtering the years of 2015 to 2020.
| Title and Code in | Gene Signature |
|---|---|
| Prognostic gene signature identification using causal structure learning: applications in kidney cancer [ | ETV5, CREB3L1, GMPS, RBM15, SEPT6, TTL, ARID1A, ERCC5, TFG, FLT3, SLC34A2, FAM46C, PER1, DDB2, NACA, MLLT10, HMGA1, TCF12, RUNX1, CANT1, REL, ZNF331, JAZF1, ASPSCR1, PLAG1, NOTCH1, TAL2, ERCC2, SMARCA4, DNMT3A, HOXA11, GNAS, CHEK2, HLF, GNAQ, ETV6, SET, KIF5B, TRRAP, CDKN2C, VHL, RPL22, CHN1, STAT3, CDK4, CD274, KTN1, CYLD, BRD3, TRIM33 |
| A Five-Gene Signature Predicts Prognosis in Patients with Kidney Renal Clear Cell Carcinoma [ | CKAP4, ISPD, MAN2A2, OTOF, SLC40A1 |
| A four-gene signature predicts survival in clear-cell renal-cell carcinoma [ | PTEN, PIK3C2A, ITPA, BCL3 |
| Identification and validation of an eight-gene expression signature for predicting high Fuhrman grade renal cell carcinoma [ | ATOH8, ATP1A3, C10orf4, C17orf79, CHMP4C, CNGA1, EDA, FBXL3, GMDS, ISL2, KISS1, KLF2, MYADML2, NCRNA00116, OAZ1, ODZ3, PLA2G15, PPP1R1A, RAB40A, RRAS, SPOCK1, SQSTM1, TXNDC16, VAMP3 |
| Comprehensive assessment gene signatures for clear cell renal cell carcinoma prognosis [ | INTS8, GTPBP2, ANK3, SLC16A12, LIMCH1, Hsa-mir-374a |
| A five-gene signature may predict sunitinib sensitivity and serve as prognostic biomarkers for renal cell carcinoma [ | BIRC5, CD44, MUC1, TF, CCL5 |
| A Gene Signature of Survival Prediction for Kidney Renal Cell Carcinoma by Multi-Omic Data Analysis [ | BID, CCNF, DLX4, FAM72D, PYCR1, RUNX1, TRIP13 |
| Prognostic value of a gene signature in clear cell renal cell carcinoma [ | CENPW, FOXM1, NUF2 |
| Identification of a 5-Gene Signature Predicting Progression and Prognosis of Clear Cell Renal Cell Carcinoma [ | OTX1, FOXE1, FAM83A, HMGA2, KRT6A, DPYSL5, ANXA8, MATN4, ROS1, CSMD3, MAGEC3, AMER2, CPLX2, PI3, KRT13, ERVV-2, ERVFRDE1, ANKFN1, VTN, NFE4, ZNF114 |
| Construction and Validation of a 9-Gene Signature for Predicting Prognosis in Stage III Clear Cell Renal Cell Carcinoma [ | ATP6V1C2, PCSK1N, PREX1, ANK3, HLA-DRA, SELENBP1, TYRP1, GABRA2, SERPINA5 |
| Construction and validation of a seven-gene signature for predicting overall survival in patients with kidney renal clear cell carcinoma via an integrated bioinformatics analysis [ | PODXL, SLC16A12, ZIC2, ATP2B3, KRT75, C20orf141, CHGA |
| A 14 immune-related gene signature predicts clinical outcomes of kidney renal clear cell carcinoma [ | TXLNA, SEMA3G, AR, BID, IL20RB, CCR10, BMP8A, SEMA3A, CCL7, GDF1, KLRC2, LHB, FGF17, IL4 |
| A seven-gene signature model predicts overall survival in kidney renal clear cell carcinoma [ | APOLD1, C9orf66, G6PC, PPP1R1A, CNN1G, TIMP1, TUBB2B |
| Identification of gene signature for treatment response to guide precision oncology in clear-cell renal cell carcinoma [ | ANGPT4, EDN1, VEGFA, ESM1, FLT1, KDR, CD34, PECAM1, NOTCH1, EDNRB, STIM2, FYN, VWF, GJA1, MCF2L, PPM1F, PTPRB, HEY1, ETS1, EXOC3L2, TBXA2R, TCF4, S1PR1, SLC9A3R2, NES, NFATC1, NOS3, PDE2A, CORO1A, CCR5, CXCR3, PTK2B, WAS, CD72, IL16, FYB1, FASLG, FERMT3, FOXP3, XCL2, CD3E, CD7, LAX1, CD38, LCP1, LCP2, ITK, LAT, LCK, GRK2, CCL4, CCL5, CD2, PRF1, TIGIT, GZMA, GZMB, CD8A, CTLA4, EOMES, PDCD1, PYHIN1, SLA2, LTA, PSMB8, PSMB9 |
Figure A2Scatter plot of median of gene expression comparing TCGA-KIRC and ICGC-RECA gene expression. (a) Raw counts. (b) log2(count + 1) normalization. (c) Variance-stabilizing transformation with DESeq2. (d) Box-Cox transformation. (e) Scaling between zero and one (with Caret R package and ‘range’ method). (f) Scaling between zero and one (with BBmisc R package and ‘range’ method).
New gene signatures of ccRCC obtained by state-of-art machine learning for feature selection methods: Recursive Feature Elimination, Boruta, Rpart, GBM and XGBoost for Survival.
| Code | Method | N. Genes | Gene Signature |
|---|---|---|---|
| GBM | Filtering with Generalized Boosted Regression Models for Cox Proportional Hazard | 30 | AC084117.1, CRHBP, LINC00973, ITPKA, IGFN1, C14orf37, OTX1, LINC02446, HOTTIP, NEIL3, ZIC5, CCDC154, IL4, AC008663.1, FER1L4, DUSP5P1, AL078604.2, KRT6A, SPATC1L, RTL1, LINC01597, CRABP1, RASGRP3, C3orf85, AL034399.1, TRIM4, LINC00475, ADAMTS14, DPP6 |
| Rpart | Filtering with Recursive partitioning for survival trees | 30 | TROAP, KIF18B, AURKB, LINC00973, AC003092.1, G6PC, ZNF181, MYBL2, FOXM1, NUF2, POU4F1, APOM, AR, NPHS1, AC018638.2, MERTK, AC098679.1, AL353637.1, IYD, C17orf80, SLC12A3, CDCA2, LINC02362, SRD5A3, EIF3F, AC138393.1, MCC, WFIKKN1, ALDOB, APOL5 |
| XGBoost | Filtering with | 30 | LINC00973, LINC01271, CHAT, SPIC, AL355796.1, DLK1, ZIC5, LINC01700, ENTPD6, ATOH8, C14orf37, WNT7B, THEG, AC084117.1, ADA2, DCSTAMP, AL450311.2, A3GALT2, CNTNAP3B, TBC1D27, BIRC7, LINC00943, LINC01529, OR4C6, FAM47E, BCL3, AC105118.1, AL359736.1, SLC44A3, LINP1 |
| Boruta | Wrapper Boruta with XGBoost for Survival Data | 43 | Age, ZIC2, CHAT, AMH, OTX1, BARX1, TROAP, CKAP4, ITPKA, NUF2, KRT75, KIF18B, SLC18A3, AL355796.1, RPL10P19, LINC02154, LINC00973, IL4, HOTAIRM1, Z84485.1, LINC02362, CASP9, CCNF, RTL1, BID, CHGA, RANBP3L, ZIC5, SLC16A12, SPATC1L, CD44, KRI1, RUFY4, AC073324.1, AC091812.1, AC156455.1, AGAP6, AC128685.1, SEMA3G, IGFN1, KLRC2, ANXA8, AURKB |
| RFE | Wrapper with Recursive Feature Elimination | 89 | A3GALT2, AC006450.2, AC073324.1, AC093520.1, AC103925.1, AC120498.6, AC128685.1, AC156455.1, ADAMTS14, AL355796.1, AL592494.1, AL606519.1, AMH, ANK3, ANXA8, AP000697.1, AP001029.1, AURKB, BARX1, BIRC5, C20orf141, CCNF, CDC42P2, CENPW, CHAT, CHGA, CKAP4, CRHBP, DLX4, DMRT3, DUSP5P1, G6PC, GOLGA6L2, GOLGA6L7P, HAMP, HAO1, HOTAIRM1, HP, IGFN1, IGHJ3P, IL20RB, IL4, ISL2, ITPKA, KIF18B, KLRC2, KRT75, KRT78, LINC00051, LINC00460, LINC00524, LINC00896, LINC00973, LINC01234, LINC01501, LINC01655, LINC01700, LINC01956, LINC02154, LINC02362, NEIL3, NFE4, NUF2, OTX1, PAEP, PGLYRP2, PI3, PITX1, PLG, PTPRB, RALYL, RPL10P19, RTL1, SAA1, SAA2, SAA4, SIM2, SLC16A12, SLC18A3, TGM3, TRIP13, TROAP, VSX1, WFDC10B, Z84485.1, ZIC2, ZIC5, ZPLD1 |
| mRMR | Ensemble of Min-redundancy and Max-relevance with survival data | 65 | AR, AL353637.1, DPP6, FOXJ1, GNB3, HHLA2, IL4, LIMCH1,LINC01732, OTX1, SAA1, SEMA3G, ZIC2 |
Figure A3Variable ranking based on mutual information of 10 most important genes of mRMR 13-gene signature of ccRCC. The most representative genes with respect to AJCC Staging of TCGA dataset.
Figure A4Collinearity analysis with variance inflation factors 13-gene signature of ccRCC. None of genes had variance inflation factors ≥ 5, indicating no collinearity or redundancy on the signature.
Figure A5Correlation analysis between genes of mRMR 13-gene signature of ccRCC. No strong correlation between genes ≥ 0.70 was found, including the clinical data of age, overall survival status and AJCC staging.
Figure A6Density plot of the distribution of overall patient survival in TCGA-KIRC and ICGC-RECA. The dotted line indicates the mean of distributions, and the solid lines indicate the time prediction used for internal and external validations. We restrict the 10-year prediction for TCGA-KIRC to exclude outliers in the long tail of the density plot of the patient’s overall survival. For the ICGC-RECA dataset, we decided to maintain a 7-year prediction in order to include all samples, and limit the time prediction to the range of distribution of this dataset for external validation.
Study Characteristics of TCGA-KIRC and ICGC-RECA cohort with the clinical characteristics: age, gender, tumor grade, metastasis, and staging by the American Joint Committee on Cancer (AJCC).
| Clinical Characteristics | Training Cohort | Validation | ||
|---|---|---|---|---|
| Overall survival (days) | Mean (SD) | 1343.2 (976.6) | 1511.6 (634.6) | 0.113 |
| Overall survival status, | Alive | 359/530 (67.7) | 61/91 (67.0) | 0.991 |
| Deceased | 171/530 (32.3) | 30/91 (33.0) | ||
| Age, years | Mean (SD) | 60.5 (12.0) | 60.5 (10.0) | 0.99 |
| Gender, | Female | 183/530 (34.5) | 39/91 (42.9) | 0.158 |
| Male | 347/530 (65.5) | 52/91 (57.1) | ||
| AJCC stage, | T1 | 270/530 (50.9) | 54/91 (59.3) | 0.343 |
| T2 | 70/530 (13.2) | 13/91 (14.3) | ||
| T3 | 179/530 (33.8) | 22/91 (24.2) | ||
| T4 | 11/530 (2.1) | 2/91 (2.2) | ||
| Neoplasm, | N0 | 79 (86.8) | 239 (45.1) | <0.001 |
| N1 | 2 (2.2) | 16 (3.0) | ||
| NX | 10 (11.0) | 275 (51.9) | ||
| Metastasis, | M0 | 422/528 (79.9) | 81/91 (89.0) | 0.081 |
| M1 | 78/528 (14.8) | 9/91 (9.9) | ||
| MX | 28/528 (5.3) | 1/91 (1.1) | ||
1 The metastasis values do not sum up to heading totals because of missing data. 2 The statistical tests for age and overall survival days are performed by Wilcoxon rank-sum test, and all other comparisons are by Fisher’s exact test.
Figure 2Selected genes through mRMR. (a) Venn diagram of prefiltered gene sets. A total of 3284 prefiltered genes is given by the sets of DEA between non-metastatic versus metastatic (156), normal tissues versus primary tumor (1775), genes from literature (221), significant eQTLs genes (1259), and 124 genes overlapping in two or three intersections of sets. (b) Volcano plot of DEA comparing normal tissues versus primary tumor samples of TCGA-KIRC. In green, we see the downregulated genes of normal tissues versus primary tumors (DPP6 and FOXJ1). In red, we see the upregulated genes (HHLA2, LINC01732, SAA1, AL353637.1, and ZIC2). In gray, we see the non significant genes with low fold change. (c) Volcano plot of DEA comparing non-metastatic versus metastatic samples. In red, we see the upregulated genes (OTX1 and ZIC2).
Figure A7Circular diagram of mRMR gene signature and the source of genes DEA, genes from GTEx portal of expression quantitative trait loci (eQTLs) in Kidney Cortex, and gene signatures from the literature.
Figure 3Benchmark with internal and external validation. (a) Comparison of 14 gene signatures from the literature and 6 feature selection on 8 models for survival risk, showing the predicted AUC of survival outcome in 10-years prediction. (b) Boxplots of results of each gene signature and feature selection for 7-year prediction.
Figure 4Survival risk predictions with mRMR signature and dimensionality reduction. (a) The survival curves are predicted in three equal-size strata of risk groups of the TCGA-KIRC dataset: higher risk (red), lower risk (green), and moderate risk (orange). (b) A dimension reduction of genes from the mRMR signature, using principal components analysis. (c) The survival curves were predicted by validating the ICGC-RECA dataset. (d) The principal components analysis of the ICGC-RECA dataset with genes of mRMR signature.
Figure A8Forest plot for Cox proportional hazards model displaying the significative genes (AL353637.1, DPP6, FOXJ1, HHLA2, and SAA1). The statistical significance between comparisons is given by * p-value < 0.05, ** p-value < 0.01, and *** p-value < 0.001.
Figure 5Aalen’s additive Cox regression model for censored data of the mRMR signature, and the clinical features age and metastasis. (a) The dot-and-whisker plots with the estimated coefficients (β), z-score, their confidence intervals (95%), and the p-values. (b) Curves of each term for the censored data in relation to time (days).
Figure A9Analysis performed using UALCAN portal with data of ccRCC from Clinical Proteomic Tumor Analysis Consortium (CPTAC) [50], available at http://ualcan.path.uab.edu/ (accessed on 1 March 2022). Z-values represent standard deviations from the median across samples for the given cancer type of ccRCC. The statistical significance between comparisons is given by * p-value < 0.05, ** p-value < 0.01, and *** p-value < 0.001. (a) Comparison of protein expression by cancer stages of AR gene. (b) Comparison of protein expression by cancer stages of GNB3. (c) Comparison of protein expression by cancer stages of HHLA2. (d) Comparison of protein expression by cancer stages of LIMCH1. (e) Comparison of protein expression by cancer stages of SAA1.
Figure A10Heatmap with hierarchical clustering combining RNA-seq expression of patients on TCGA-KIRC and ICGC-RECA. Columns are genes of the mRMR signature. Rows indicate RNA-seq expression of 590 patients of TCGA-KIRC and ICGC-RECA. Data of patients with distant metastasis that cannot be assessed (MX) were removed in order to clarify the clustering.
Figure 6Gene enrichment analysis. (a) Heatmap of enriched terms and relationships of genes, displaying the fold change of differential analysis of normal tissues versus primary tumors of TCGA-KIRC samples. (b) Enrichment analysis of gene-disease associations (GDAs) from DisGeNET (v7.0) of expert curated databases.