| Literature DB >> 35053403 |
Khaled Bin Satter1, Paul Minh Huy Tran1, Lynn Kim Hoang Tran1, Zach Ramsey2, Katheine Pinkerton1, Shan Bai1, Natasha M Savage2, Sravan Kavuri2, Martha K Terris3, Jin-Xiong She1,4, Sharad Purohit1,4,5.
Abstract
Publicly available gene expression datasets were analyzed to develop a chromophobe and oncocytoma related gene signature (COGS) to distinguish chRCC from RO. The datasets GSE11151, GSE19982, GSE2109, GSE8271 and GSE11024 were combined into a discovery dataset. The transcriptomic differences were identified with unsupervised learning in the discovery dataset (97.8% accuracy) with density based UMAP (DBU). The top 30 genes were identified by univariate gene expression analysis and ROC analysis, to create a gene signature called COGS. COGS, combined with DBU, was able to differentiate chRCC from RO in the discovery dataset with an accuracy of 97.8%. The classification accuracy of COGS was validated in an independent meta-dataset consisting of TCGA-KICH and GSE12090, where COGS could differentiate chRCC from RO with 100% accuracy. The differentially expressed genes were involved in carbohydrate metabolism, transcriptomic regulation by TP53, beta-catenin-dependent Wnt signaling, and cytokine (IL-4 and IL-13) signaling highly active in cancer cells. Using multiple datasets and machine learning, we constructed and validated COGS as a tool that can differentiate chRCC from RO and complement histology in routine clinical practice to distinguish these two tumors.Entities:
Keywords: chromophobe; classification; gene signature; machine learning; oncocytoma; transcriptomic
Mesh:
Year: 2022 PMID: 35053403 PMCID: PMC8774230 DOI: 10.3390/cells11020287
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Figure 1Flow chart depicting selection and preparation of chRCC and RO arrays from GEO for meta-analysis.
GEO datasets selected for chromophobe renal cell carcinoma (chRCC) and renal oncocytoma (RO) classification as discovery and validation datasets. Table contains details on number of probes, total number of arrays in the study and number of arrays selected under chRCC, RO and normal kidney tissue.
| GEO Accession ID | Number of Probes | Total Number of Arrays/Study | Number of Arrays Selected | ||
|---|---|---|---|---|---|
| RO | ChRCC | Normal Kidney | |||
| GSE11024 | 17,700 | 79 | 7 | 6 | 12 |
| GSE11151 | 54,676 | 67 | 4 | 4 | 5 |
| GSE19982 | 54,676 | 30 | 15 | 15 | 0 |
| GSE8271 | 54,676 | 34 | 10 | 10 | 0 |
| GSE2109 | 17,232 | 2158 | 0 | 18 | 0 |
| TCGA-KICH | 60,483 | 89 | 0 | 65 | 24 |
| GSE12090 | 54,676 | 18 | 9 | 9 | 0 |
Figure 2Quality control of the discovery dataset showing batch effects before and after correction. Principal component analysis showing differences in batch (A) is higher than difference in histology (B) for chromophobe (chRCC) and renal oncocytoma (RO) and normal kidney tissue arrays (N) before batch effect correction. After batch correction by empirical bayes (ComBat), histological differences (D) are higher than batch differences (C).
Figure 3Implementation of unsupervised machine learning algorithm (UMLA) for differentiating chRCC and RO: (A) two dimension embedding for the whole genome (n = 15,875 genes) with UMAP, showing two clusters with high concordance with their histological classification; (B) representative map showing optimized final parameter for UMAP, best performing for maximum inter-cluster and minimum intra-cluster distance, red = chRCC, blue = RO; (C) representative map showing poorly fit parameters for UMAP analysis, red = “chRCC, blue = RO; (D) representative iterations for DBU (Iteration no 70, & 136). All 1000 iterations were tracked to determine final groups where support from > 70% iterations were needed, red triangles = cluster 1 in machine learning model, green triangles = cluster 2 in machine learning model (E) group consensus heatmap. Samples are presented in columns and iterations are in rows. Two colors (dark and light blue) represent two DBU groups based on the 1000 iterations of DBU with 1000 random genes in each iteration; (F) Sankey’s diagram tracking all samples from the study to DBU classification, color represents histology type (ChRCC = green, RO = pink). A total of 87/89 samples follow their histological classification with DBU.
Figure 4Gene selection and unsupervised model consistency: (A,B) boxplot showing log2-transformed expression values for (y-axis) for two representative genes (HOOK2 and PNPT1) from COGS in chRCC (green) and RO (magenta), outliers are represented with points (black); (C) group consensus heatmap by DBU with 20 random genes from GS30, showing a consistent classification with unsupervised models; (D) heatmap of COGS in meta-analysis showing expression differences between the subtypes.
List of thirty genes combined to create COGS signature for distinguishing chRCC from RO. Data presented is for discovery dataset to show sensitivity-specificity, accuracy, area under the receiver operator curve (AUROC) and log fold change (FC) for discovery meta-dataset.
| Gene | Optimum Cutpoint | Accuracy | Sensitivity | Specificity | AUROC | FC * | Adj |
|---|---|---|---|---|---|---|---|
|
| 8.87 | 0.98 | 0.96 | 1.00 | 1.00 | 4.48 | 4.63 × 10−26 |
|
| 8.29 | 0.98 | 0.94 | 1.00 | 1.00 | 36.98 | 2.61 × 10−32 |
|
| 8.02 | 0.94 | 0.91 | 0.97 | 0.99 | 2.76 | 1.07 × 10−21 |
|
| 7.82 | 1.00 | 1.00 | 1.00 | 1.00 | 3.52 | 1.73 × 10−26 |
|
| 10.50 | 0.93 | 0.91 | 0.94 | 0.97 | 42.73 | 1.12 × 10−19 |
|
| 5.23 | 0.94 | 0.94 | 0.94 | 0.98 | 3.32 | 9.59 × 10−20 |
|
| 7.86 | 1.00 | 1.00 | 1.00 | 1.00 | 3.07 | 1.85 × 10−28 |
|
| 7.99 | 0.99 | 0.98 | 1.00 | 1.00 | 10.32 | 5.01 × 1031 |
|
| 9.42 | 1.00 | 1.00 | 1.00 | 1.00 | 3.84 | 5.12 × 10−36 |
|
| 6.79 | 0.99 | 1.00 | 0.98 | 1.00 | 3.68 | 3.84 × 10−23 |
|
| 5.49 | 1.00 | 1.00 | 1.00 | 1.00 | 2.80 | 9.52 × 10−24 |
|
| 9.19 | 1.00 | 1.00 | 1.00 | 1.00 | 2.93 | 3.08 × 10−28 |
|
| 7.72 | 0.96 | 0.94 | 1.00 | 0.98 | 55.34 | 4.83 × 10−27 |
|
| 5.87 | 0.94 | 0.96 | 0.92 | 0.97 | 6.88 | 1.60 × 10−21 |
|
| 10.35 | 0.94 | 0.94 | 0.94 | 0.97 | 3.33 | 3.70 × 10−16 |
|
| 6.31 | 0.96 | 0.94 | 1.00 | 0.99 | 8.02 | 3.03 × 10−23 |
|
| 8.72 | 1.00 | 1.00 | 1.00 | 1.00 | 3.46 | 1.51 × 10−34 |
|
| 5.70 | 0.99 | 1.00 | 0.98 | 1.00 | 3.04 | 1.26 × 10−28 |
|
| 8.91 | 1.00 | 1.00 | 1.00 | 1.00 | 5.22 | 2.91 × 10−31 |
|
| 6.54 | 0.99 | 1.00 | 0.98 | 1.00 | 3.63 | 2.68 × 10−29 |
|
| 9.18 | 0.99 | 0.97 | 1.00 | 0.99 | 3.16 | 2.79 × 10−27 |
|
| 8.52 | 0.96 | 0.94 | 0.98 | 0.95 | 6.05 | 6.35 × 10−19 |
|
| 7.79 | 0.98 | 0.96 | 1.00 | 1.00 | 5.78 | 2.48 × 10−27 |
|
| 8.38 | 1.00 | 1.00 | 1.00 | 1.00 | 2.80 | 5.05 × 10−34 |
|
| 11.59 | 0.98 | 0.97 | 0.98 | 1.00 | 4.02 | 4.40 × 10−26 |
|
| 6.89 | 0.98 | 0.96 | 1.00 | 0.99 | 6.29 | 4.17 × 10−23 |
|
| 8.43 | 0.95 | 1.00 | 0.91 | 0.98 | 3.70 | 3.90 × 10−19 |
|
| 7.32 | 0.92 | 0.91 | 0.92 | 0.97 | 3.59 | 1.05 × 10−18 |
|
| 10.59 | 1.00 | 1.00 | 1.00 | 1.00 | 3.53 | 4.32 × 10−30 |
|
| 9.71 | 0.99 | 1.00 | 0.98 | 1.00 | 3.68 | 1.09 × 10−26 |
* FC: fold change, AUC: area under the curve.
Figure 5Molecular and pathway analysis: (A) top 194 Differentially expressed genes (candidate genes) between chRCC and RO. Genes are presented in rows and arrays are in columns; (B) bubble plot showing top upregulated pathways in chRCC and RO (x-axes) when compared to normal kidney tissue from gene set enrichment analysis for canonical pathways for differentially expressed genes. Y-axes represent different canonical pathways and size of the bubble represents normalized enrichment score.
Figure 6COGS validation on RNA-Seq (TCGA-KICH) and micro-array dataset (GSE12090): (A) two-dimensional embedding plot with UMAP using COGS showing no batch effect between the studies in the validation dataset; (B) sample plot with histology annotation shows distinct clusters for chRCC (n = 74) and RO (n = 9) with COGS; (C) unsupervised hierarchical clustering with COGS for validation dataset (D) UMAP of TCGA renal cohort showing distinct cluster for chRCC samples (lime green) for COGS.