| Literature DB >> 28349958 |
Sherry Bhalla1, Kumardeep Chaudhary1, Ritesh Kumar2, Manika Sehgal1, Harpreet Kaur1, Suresh Sharma3, Gajendra P S Raghava1.
Abstract
In this study, an attempt has been made to identify expression-based gene biomarkers that can discriminate early and late stage of clear cell renal cell carcinoma (ccRCC) patients. We have analyzed the gene expression of 523 samples to identify genes that are differentially expressed in the early and late stage of ccRCC. First, a threshold-based method has been developed, which attained a maximum accuracy of 71.12% with ROC 0.67 using single gene NR3C2. To improve the performance of threshold-based method, we combined two or more genes and achieved maximum accuracy of 70.19% with ROC of 0.74 using eight genes on the validation dataset. These eight genes include four underexpressed (NR3C2, ENAM, DNASE1L3, FRMPD2) and four overexpressed (PLEKHA9, MAP6D1, SMPD4, C11orf73) genes in the late stage of ccRCC. Second, models were developed using state-of-art techniques and achieved maximum accuracy of 72.64% and 0.81 ROC using 64 genes on validation dataset. Similar accuracy was obtained on 38 genes selected from subset of genes, involved in cancer hallmark biological processes. Our analysis further implied a need to develop gender-specific models for stage classification. A web server, CancerCSP, has been developed to predict stage of ccRCC using gene expression data derived from RNAseq experiments.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28349958 PMCID: PMC5368637 DOI: 10.1038/srep44997
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The performance of single gene-based threshold models developed using top overexpressed and under-expressed genes in early stage of ccRCC patients along with the brief description of molecular function and cancer hallmark biological process (Cancer hallmark GO term) associated with each gene.
| S. No. | Gene | Threshold | Performance | Molecular Function | Cancer hallmark GO terms | |
|---|---|---|---|---|---|---|
| Accuracy (%) | ROC | |||||
| 1 | −0.48 | 71.12 | 0.67 | ACE Inhibitor Pathway, Aldosterone-regulated sodium reabsorption, transcription factor activity, sequence-specific DNA binding, steroid hormone receptor activity | — | |
| 2 | −0.04 | 66.83 | 0.67 | Transferase activity, poly(A) RNA-binding | — | |
| 3 | 0.06 | 65.39 | 0.66 | — | — | |
| 4 | 0.09 | 63.72 | 0.65 | 1-phosphatidylinositol binding | — | |
| 5 | −0.01 | 63.01 | 0.65 | Induces bone and cartilage development | — | |
| 6 | 0.18 | 63.48 | 0.65 | Nucleosome assembly | — | |
| 7 | −0.22 | 66.59 | 0.65 | Transcription factor activity, sequence-specific DNA binding, cationtransmembrane transporter activity, ligand-dependent nuclear receptor binding | DNA repair | |
| 8 | 0.23 | 62.77 | 0.64 | — | — | |
| 9 | 0.27 | 61.58 | 0.64 | Cleaves chromatin DNA to nucleosomal units, endonuclease activity, calcium ion binding | — | |
| 10 | −0.1 | 65.63 | 0.64 | Involved in cell division | Cell cycle | |
| 11 | 0.48 | 69.93 | 0.65 | Glycolipid binding, glycolipid transporter activity | — | |
| 12 | 0.2 | 67.54 | 0.65 | Sphingolipid metabolic and catabolic process | — | |
| 13 | −0.19 | 66.35 | 0.65 | G-protein coupled receptor activity, bradykinin receptor binding, angiotensin type I and type II receptor activity | Cell motility, Response to external stimulus | |
| 14 | −0.37 | 61.58 | 0.65 | Mediates endoplasmic reticulum (ER) stress-induced apoptosis by activating CASP4 | — | |
| 15 | 0.16 | 66.11 | 0.65 | Crucial regulator of heart and vessel formation and integrity | Phosphorylation | |
| 16 | 0.05 | 66.35 | 0.65 | Calmodulin binding, microtubule binding | — | |
| 17 | −0.37 | 68.02 | 0.65 | Phosphatidylinositol-4,5-bisphosphate binding, phosphatidic acid binding | — | |
| 18 | 0.2 | 65.39 | 0.65 | Mediates endoplasmic reticulum (ER) stress-induced apoptosis, cysteine-type peptidase and endopeptidase activity | Apoptosis, Immune response | |
| 19 | 0.38 | 67.06 | 0.65 | Transcription factor activity, RNA polymerase II core promoter proximal region sequence-specific binding | Immune response, Response to external stimulus | |
| 20 | 0.32 | 66.35 | 0.64 | G-protein coupled receptor activity, PDZ domain binding, Stimulates phospholipase C | — | |
The performance of classification models based on RCSP-set-Threshold (28 genes) developed using different machine learning techniques on training and independent or external validation dataset.
| Technique | Dataset | Performance Measures | ||||
|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy (%) | MCC | ROC | ||
| RF | Training | 73.62 | 72.12 | 73.03 | 0.45 | 0.77 |
| Validation | 73.02 | 60.98 | 68.27 | 0.34 | 0.74 | |
| Naive Bayes | Training | 75.98 | 67.27 | 72.55 | 0.43 | 0.76 |
| Validation | 77.78 | 60.98 | 71.15 | 0.39 | 0.76 | |
| SMO | Training | 83.86 | 55.76 | 72.79 | 0.42 | 0.70 |
| Validation | 80.95 | 53.66 | 70.19 | 0.36 | 0.67 | |
| J48 | Training | 64.17 | 66.06 | 64.92 | 0.3 | 0.67 |
| Validation | 68.25 | 58.54 | 64.42 | 0.26 | 0.67 | |
| SVM | Training | 75.98 | 69.09 | 73.27 | 0.45 | 0.78 |
| Validation | 74.6 | 65.85 | 71.15 | 0.4 | 0.77 | |
These RCSP-set-Threshold features are selected by the threshold-based approach followed by the removal of correlated features.
Figure 1The protein–protein interaction network among the potential ccRCC biomarkers generated using STRING database (with direct and indirect interactions) ((a) for RCSP-set-Threshold, (b) for RCSP-set-Weka, and (c) for RCSP-set-Weka-Hall).
The performance of Support vector machine (SVM) and Random Forest (RF) based models developed using different sets of selected features on training and independent or external validation dataset.
| Features | Dataset | Technique | Performance Measures | ||||
|---|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy (%) | MCC | ROC | |||
| setA-1 (4 genes) | Training | SVM | 71.65 | 70.3 | 71.12 | 0.41 | 0.76 |
| Validation | 68.25 | 78.05 | 72.12 | 0.45 | 0.80 | ||
| Training | RF | 70.87 | 65.45 | 68.74 | 0.36 | 0.69 | |
| Validation | 73.02 | 58.54 | 67.31 | 0.32 | 0.74 | ||
| setB-1 (4 genes) | Training | SVM | 71.26 | 70.3 | 70.88 | 0.41 | 0.74 |
| Validation | 74.6 | 68.29 | 72.12 | 0.42 | 0.74 | ||
| Training | RF | 80.31 | 49.7 | 68.26 | 0.32 | 0.65 | |
| Validation | 82.54 | 51.22 | 70.19 | 0.36 | 0.68 | ||
| Combo-1 (8 genes) | Training | SVM | 75.20 | 70.30 | 73.27 | 0.45 | 0.77 |
| Validation | 77.78 | 68.29 | 74.04 | 0.46 | 0.80 | ||
| Training | RF | 81.1 | 55.15 | 70.88 | 0.38 | 0.73 | |
| Validation | 82.54 | 51.22 | 70.19 | 0.36 | 0.74 | ||
These gene sets include setA-1 (4 overexpressed genes), setB-1 (4 under-expressed genes) and Combo-1 (combination of both gene sets i.e. setA-1 and setB-1).
The performance of Support vector machine (SVM) and Random Forest (RF) models developed using different sets of features selected via SVM technique on training and independent or external validation dataset.
| Features | Dataset | Technique | Performance Measures | ||||
|---|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy (%) | MCC | ROC | |||
| setA-2 (5 genes) | Training | SVM | 68.9 | 73.94 | 70.88 | 0.42 | 0.75 |
| Validation | 65.08 | 70.73 | 67.31 | 0.35 | 0.75 | ||
| Training | RF | 81.5 | 56.97 | 71.84 | 0.4 | 0.73 | |
| Validation | 77.78 | 46.34 | 65.38 | 0.25 | 0.68 | ||
| setB-2 (5 genes) | Training | SVM | 68.9 | 70.91 | 69.69 | 0.39 | 0.76 |
| Validation | 60.32 | 70.73 | 64.42 | 0.3 | 0.72 | ||
| Training | RF | 71.65 | 64.85 | 68.97 | 0.36 | 0.71 | |
| Validation | 69.84 | 56.1 | 64.42 | 0.26 | 0.70 | ||
| Combo-2 (10 genes) | Training | SVM | 72.44 | 72.89 | 72.62 | 0.45 | 0.78 |
| Validation | 71.43 | 68.29 | 70.19 | 0.39 | 0.77 | ||
| Training | RF | 76.19 | 65.85 | 72.12 | 0.42 | 0.76 | |
| Validation | 70.47 | 70.3 | 70.41 | 0.4 | 0.76 | ||
Figure 2A box plot diagram representing median log expression distribution of 15 genes differentially expressed in early and late stage of ccRCC with a p-value < 0.01 calculated using Wilcoxon rank-sum test.
These genes are the union of Combo-1 and Combo-2 sets.
The performance of models based on different machine techniques using RCSP-set-Weka (64 genes) selected by Weka.
| Technique | Dataset | Performance Measures | ||||
|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy (%) | MCC | ROC | ||
| RF | Training | 80.63 | 81.10 | 80.82 | 0.61 | 0.87 |
| Validation | 78.12 | 64.29 | 72.64 | 0.43 | 0.75 | |
| Naive Bayes | Training | 79.05 | 70.12 | 75.54 | 0.49 | 0.81 |
| Validation | 75.00 | 64.29 | 70.75 | 0.39 | 0.76 | |
| SMO | Training | 84.19 | 70.12 | 78.66 | 0.55 | 0.77 |
| Validation | 79.69 | 66.67 | 74.53 | 0.47 | 0.73 | |
| J48 | Training | 67.19 | 74.39 | 70.02 | 0.41 | 0.73 |
| Validation | 64.06 | 88.10 | 73.58 | 0.51 | 0.79 | |
| SVM | Training | 79.84 | 75.61 | 78.18 | 0.55 | 0.83 |
| Validation | 73.44 | 71.43 | 72.64 | 0.44 | 0.81 | |
These models were evaluated using 10-fold cross validation on training dataset as well as on independent or external validation dataset.
The performance of models developed using different machine techniques based on RCSP-set-Weka-Hall (38 genes) selected from Weka.
| Technique | Dataset | Performance Measures | ||||
|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy (%) | MCC | ROC | ||
| RF | Training | 75.10 | 78.66 | 76.50 | 0.53 | 0.84 |
| Validation | 67.19 | 78.57 | 71.70 | 0.45 | 0.75 | |
| Naive Bayes | Training | 79.84 | 71.34 | 76.50 | 0.51 | 0.83 |
| Validation | 75.00 | 66.67 | 71.70 | 0.41 | 0.79 | |
| SMO | Training | 85.77 | 66.46 | 78.18 | 0.54 | 0.76 |
| Validation | 82.81 | 59.52 | 73.58 | 0.44 | 0.71 | |
| J48 | Training | 71.54 | 61.59 | 67.63 | 0.33 | 0.69 |
| Validation | 68.75 | 71.43 | 69.81 | 0.39 | 0.68 | |
| SVM | Training | 80.24 | 73.78 | 77.70 | 0.54 | 0.83 |
| Validation | 73.44 | 71.43 | 72.64 | 0.44 | 0.78 | |
These genes are specifically involved in cancer hallmark biological processes (Cancer hallmark GO terms). The model was evaluated using 10-fold cross validation on training dataset as well as on independent external validation dataset.
The performance of gender-specific Support vector machine (SVM) and Random Forest (RF) models developed using Weka selected genes/features on training and independent or external validation dataset.
| Gender | Technique | Dataset | Performance Measures | ||||
|---|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy (%) | MCC | ROC | |||
| Female | RF | Training | 87.1 | 88.89 | 87.76 | 0.75 | 0.93 |
| Validation | 75 | 71.43 | 73.68 | 0.45 | 0.76 | ||
| SVM | Training | 89.25 | 79.63 | 85.71 | 0.69 | 0.90 | |
| Validation | 75 | 85.71 | 78.95 | 0.59 | 0.82 | ||
| Male | RF | Training | 83.02 | 73.39 | 79.10 | 0.57 | 0.83 |
| Validation | 75.61 | 58.62 | 68.57 | 0.35 | 0.72 | ||
| SVM | Training | 83.02 | 76.15 | 80.22 | 0.59 | 0.87 | |
| Validation | 78.05 | 75.86 | 77.14 | 0.53 | 0.80 | ||
Figure 3The gene ontology analysis depicting percentage distribution of different biomarkers in major biological processes, molecular functions and cellular components from the five gene sets.
In the process of gene enrichment, 56 out of 64 genes, 32 out of 38 genes, 26 out of 28 genes, 8 out of 10 genes and 7 out of 8 genes were annotated respectively.