| Literature DB >> 30065881 |
Mei Sze Tan1, Siow-Wee Chang1, Phaik Leng Cheah2, Hwa Jen Yap3.
Abstract
Although most of the cervical cancer cases are reported to be closely related to the Human Papillomavirus (HPV) infection, there is a need to study genes that stand up differentially in the final actualization of cervical cancers following HPV infection. In this study, we proposed an integrative machine learning approach to analyse multiple gene expression profiles in cervical cancer in order to identify a set of genetic markers that are associated with and may eventually aid in the diagnosis or prognosis of cervical cancers. The proposed integrative analysis is composed of three steps: namely, (i) gene expression analysis of individual dataset; (ii) meta-analysis of multiple datasets; and (iii) feature selection and machine learning analysis. As a result, 21 gene expressions were identified through the integrative machine learning analysis which including seven supervised and one unsupervised methods. A functional analysis with GSEA (Gene Set Enrichment Analysis) was performed on the selected 21-gene expression set and showed significant enrichment in a nine-potential gene expression signature, namely PEG3, SPON1, BTD and RPLP2 (upregulated genes) and PRDX3, COPB2, LSM3, SLC5A3 and AS1B (downregulated genes).Entities:
Keywords: Cervical cancer prognosis; Feature selection; Gene expression profiling; Machine learning; Meta-analysis; Potential gene signature
Year: 2018 PMID: 30065881 PMCID: PMC6064203 DOI: 10.7717/peerj.5285
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Details of the four microarray datasets used in this research.
| Study | Platform | Samples | Genes | Variables |
|---|---|---|---|---|
| HG-U133_Plus_2 | 39 | 54,675 | Lymph node status: negative, positive{19} | |
| HG-U133_Plus_2 | 33 | 54,675 | Biopsy: 1{5}, 2{6}, 3{7}, 4{5}, 5{5}, 6{3}, 7{2} | |
| HG-U133A | 61 | 22,283 | tissue: normal{24}, cell line{5}, cancer{28} | |
| HG-U133_Plus_2 | 130 | 54,675 | tissue: normal{24}, CIN1 lesions{14}, CIN2 lesions{22}, CIN3 lesions{40}, cancer{28} |
Figure 1Framework for the proposed integrative approach of meta-analysis and machine learning in gene expression profiling of cervical cancer.
*DE- differential expression value
Results of RankProd analysis-32 upregulated genes.
| ID_REF | Identifier | gene.index | RP/Rsum | FC: (class1/class2) | pfp | |
|---|---|---|---|---|---|---|
| 206731_at | CNKSR2 | 102 | 2.697 | 1.13 | 7.77E−155 | 6.64E−157 |
| 209243_s_at | PEG3 | 107 | 2.846 | 1.13 | 1.38E−149 | 2.36E−151 |
| 210549_s_at | CCL23 | 110 | 4.156 | 1.19 | 3.15E−114 | 8.08E−116 |
| 216108_at | LOC105373738 | 22 | 4.363 | 1.22 | 3.41E−110 | 1.17E−111 |
| 220298_s_at | SPATA6 | 37 | 6.35 | 1.29 | 6.39E−81 | 2.73E−82 |
| 212797_at | SORT1 | 20 | 6.371 | 1.33 | 9.09E−81 | 4.66E−82 |
| 207996_s_at | LDLRAD4 | 71 | 10.32 | 1.66 | 1.72E−49 | 1.03E−50 |
| 208762_at | SUMO1 | 59 | 11.02 | 1.66 | 8.37E−46 | 5.72E−47 |
| 207257_at | EPO | 84 | 12.25 | 1.48 | 4.26E−40 | 3.28E−41 |
| 202035_s_at | SFRP1 | 73 | 12.53 | 1.50 | 5.99E−39 | 5.12E−40 |
| 204672_s_at | ANKRD6 | 86 | 13.23 | 1.57 | 3.28E−36 | 3.09E−37 |
| 220994_s_at | STXBP6 | 62 | 13.62 | 1.57 | 9.11E−35 | 9.34E−36 |
| 213652_at | PCSK5 | 88 | 14.12 | 1.67 | 4.76E−33 | 5.29E−34 |
| 213994_s_at | SPON1 | 50 | 14.35 | 1.61 | 2.72E−32 | 3.25E−33 |
| 221606_s_at | HMGN5 | 35 | 15.39 | 1.68 | 5.13E−29 | 6.57E−30 |
| 203440_at | CDH2 | 38 | 15.74 | 1,65 | 5.24E−28 | 7.17E−29 |
| 207046_at | HIST2H4B | 33 | 16.79 | 1.66 | 3.64E−25 | 5.29E−26 |
| 211865_s_at | FZR1 | 99 | 19.24 | 1.69 | 1.47E−19 | 2.26E−20 |
| 215401_at | AU147698 | 17 | 19.79 | 1.74 | 1.72E−18 | 2.79E−19 |
| 213068_at | DPT | 4 | 20.42 | 1.75 | 2.48E−17 | 4.24E−18 |
| 214117_s_at | BTD | 34 | 22.69 | 1.80 | 1.28E−13 | 2.30E−14 |
| 34187_at | AK026407 | 100 | 23.65 | 1.83 | 2.78E−12 | 5.24E−13 |
| 213611_at | AQP5 | 82 | 25.09 | 1.86 | 1.82E−10 | 3.58E−11 |
| 209465_x_at | PTN | 53 | 26.18 | 2.00 | 3.00E−09 | 6.16E−10 |
| 208606_s_at | WNT4 | 13 | 27.42 | 1.92 | 5.31E−08 | 1.14E−08 |
| 211814_s_at | CCNE2 | 52 | 33.28 | 2.00 | 0.001144 | 0.0002542 |
| 208790_s_at | PTRF | 81 | 33.49 | 2.00 | 0.00144 | 0.0003322 |
| 200908_s_at | RPLP2 | 40 | 33.97 | 2.04 | 0.002479 | 0.0005933 |
| 219795_at | SLC6A14 | 57 | 34.09 | 2.22 | 0.002761 | 0.0006842 |
| 205730_s_at | ABLIM3 | 64 | 35.48 | 2.05 | 0.01187 | 0.003044 |
| 202648_at | TCF3 | 32 | 37.13 | 2.10 | 0.05047 | 0.01337 |
| 216267_s_at | TMEM115 | 95 | 37.72 | 2.07 | 0.07745 | 0.02118 |
Results of RankProd analysis-33 downregulated genes.
| ID_REF | Identifier | gene.index | RP/Rsum | FC:(class1/class2) | pfp | |
|---|---|---|---|---|---|---|
| 221475_s_at | RPL15 | 104 | 3.889 | 0.67 | 7.27E−120 | 1.24E−121 |
| 203535_at | S100A9 | 103 | 4.437 | 0.71 | 1.20E−108 | 3.09E−110 |
| 209719_x_at | SERPINB3 | 3 | 6.49 | 0.46 | 2.94E−79 | 1.00E−80 |
| 200074_s_at | GUK1 | 55 | 7.095 | 0.45 | 4.17E−73 | 1.78E−74 |
| 209351_at | KRT14 | 26 | 7.505 | 0.393 | 2.33E−69 | 1.20E−70 |
| 201097_s_at | ARF4 | 11 | 8.908 | 0.32 | 2.43E−58 | 1.46E−59 |
| 217845_x_at | HIGD1A | 2 | 10.75 | 0.247 | 3.25E−47 | 2.22E−48 |
| 210413_x_at | SERPINB4 | 7 | 10.93 | 0.27 | 2.59E−46 | 1.99E−47 |
| 210835_s_at | CTBP2 | 77 | 12.68 | 0.17 | 2.39E−38 | 2.05E−39 |
| 209720_s_at | SERPINB3 | 1 | 13.56 | 0.13 | 5.96E−35 | 5.60E−36 |
| 200761_s_at | ARL6IP5 | 65 | 13.75 | 0.13 | 2.67E−34 | 2.74E−35 |
| 201619_at | PRDX3 | 69 | 14.18 | 0.10 | 7.82E−33 | 8.69E−34 |
| 206276_at | LY6D | 47 | 14.37 | 0.15 | 3.27E−32 | 3.91E−33 |
| 201098_at | COPB2 | 42 | 15.65 | 0.07 | 3.17E−28 | 4.07E−29 |
| 201653_at | CNIH1 | 87 | 16.31 | 0.067 | 2.05E−26 | 2.80E−27 |
| 211906_s_at | SERPINB4 | 8 | 16.42 | 0.003 | 3.80E−26 | 5.52E−27 |
| 202753_at | PSMD6 | 56 | 17.6 | 0.01 | 3.54E−23 | 5.45E−24 |
| 202209_at | LSM3 | 15 | 17.92 | 0.004 | 1.97E−22 | 3.20E−23 |
| 211023_at | PDHB | 9 | 17.95 | 0.018 | 2.14E−22 | 3.66E−23 |
| 221896_s_at | HIGD1A | 5 | 19.47 | 954.6 | 3.69E−19 | 6.62E−20 |
| 213164_at | SLC5A3 | 63 | 22.18 | 787.9 | 2.11E−14 | 3.97E−15 |
| 201863_at | FAM32A | 106 | 22.74 | 767.2 | 1.42E−13 | 2.78E−14 |
| 209694_at | PTS | 66 | 24.92 | 688.4 | 1.09E−10 | 2.23E−11 |
| 218845_at | DUSP22 | 10 | 26.22 | 675.4 | 3.21E−09 | 6.86E−10 |
| 203315_at | NCK2 | 80 | 27.84 | 584.5 | 1.28E−07 | 2.84E−08 |
| 213357_at | GTF2H5 | 114 | 28.09 | 612.8 | 2.09E−07 | 4.83E−08 |
| 217850_at | SNORD19B | 29 | 28.27 | 592.6 | 2.95E−07 | 7.06E−08 |
| 218283_at | SS18L2 | 12 | 28.72 | 578.6 | 6.99E−07 | 1.73E−07 |
| 203282_at | GBE1 | 14 | 31.25 | 526.7 | 5.77E−05 | 1.48E−05 |
| 212488_at | COL5A1 | 68 | 32.86 | 426.3 | 0.0005592 | 0.000148 |
| 219389_at | SUSD4 | 48 | 34.36 | 439.2 | 0.003389 | 0.000927 |
| 218115_at | ASF1B | 36 | 35.81 | 405 | 0.00481 | 0.004177 |
| 201692_at | SIGMAR1 | 117 | 36.53 | 393 | 0.00775 | 0.008063 |
Notes.
Description of Tables 2 and 3: gene.index-index of genes; RP/Rsum-computed Rank Product statistics; FC: (class1/class2)-average expression levels’ computed fold change under two conditions (upregulated class and downregulated class); pfp-estimated false positive predictions value of the genes; and P-value-estimated p-value of each gene.
Figure 2Hierarchical clustering of the selected genes.
(A) Hierarchical clustering of the 32 upregulated genes. (B) Hierarchical clustering of the 33 downregulated genes. In the figures, each gene is represented by the rows and the samples are represented by the columns. The dendogram at the side indicates the relation between the pattern of the gene expression while the top-dendogram indicates the relation between the samples used. The level of expression of the genes, relative to the mean of the gene across all samples, is indicated by the color key, with the green representing the higher expression of the genes. The color bar represents the cluster of the genes after they have been cut at 1.5 of the height of the tree so that the clustering of the genes is seen more clearly. The color of the bar is indicated by the color key similarly.
Ranking of each gene (upregulated) using proposed FS methods.
| ID_REF | Identifier | HC | PCC | Relief-F | SFS | SVM-RFE | CFS | RF | IG | Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 206731_at | CNKSR2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 209243_s_at | PEG3 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 210549_s_at | CCL23 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 5 |
| 216108_at | LOC105373738 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 5 |
| 220298_s_at | SPATA6 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | |
| 212797_at | SORT1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 4 |
| 207996_s_at | LDLRAD4 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 2 |
| 208762_at | SUMO1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 3 |
| 207257_at | EPO | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 5 |
| 202035_s_at | SFRP1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 4 |
| 204672_s_at | ANKRD6 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 5 |
| 220994_s_at | STXBP6 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 5 |
| 213652_at | PCSK5 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 3 |
| 213994_s_at | SPON1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | |
| 221606_s_at | HMGN5 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 5 |
| 203440_at | CDH2 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 5 |
| 207046_at | HIST2H4B | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 211865_s_at | FZR1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 5 |
| 215401_at | AU147698 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 5 |
| 213068_at | DPT | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 5 | |
| 214117_s_at | BTD | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | |
| 34187_at | AK026407 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | |
| 213611_at | AQP5 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 5 |
| 209465_x_at | PTN | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 3 |
| 208606_s_at | WNT4 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 5 |
| 211814_s_at | CCNE2 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 5 |
| 208790_s_at | PTRF | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 200908_s_at | RPLP2 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | |
| 219795_at | SLC6A14 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |
| 205730_s_at | ABLIM3 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 5 |
| 202648_at | TCF3 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | |
| 216267_s_at | TMEM115 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 |
Notes.
The nine upregulated genes selected by most (≥6) of the methods are CNKSR2, PEG3, SPATA6, SPON1, BTD, AK026407 RPLP2, TCF3 and TMEM115.
Ranking of each gene (downregulated) using proposed FS methods.
| ID_REF | Identifier | HC | PCC | Relief-F | SFS | SVM-RFE | CFS | RF | IG | Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 221475_s_at | RPL15 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | |
| 203535_at | S100A9 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 4 |
| 209719_x_at | SERPINB3 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |
| 200074_s_at | GUK1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 3 |
| 209351_at | KRT14 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 |
| 201097_s_at | ARF4 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 4 |
| 217845_x_at | HIGD1A | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | |
| 210413_x_at | SERPINB4 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 4 |
| 210835_s_at | CTBP2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 4 |
| 209720_s_at | SERPINB3 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 3 |
| 200761_s_at | ARL6IP5 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | |
| 201619_at | PRDX3 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | |
| 206276_at | LY6D | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 3 |
| 201098_at | COPB2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||
| 201653_at | CNIH1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | |
| 211906_s_at | SERPINB4 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 3 |
| 202753_at | PSMD6 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | |
| 202209_at | LSM3 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | |
| 211023_at | PDHB | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 221896_s_at | HIGD1A | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | |
| 213164_at | SLC5A3 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
| 201863_at | FAM32A | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | |
| 209694_at | PTS | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | |
| 218845_at | DUSP22 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 5 |
| 203315_at | NCK2 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 5 |
| 213357_at | GTF2H5 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 4 |
| 217850_at | SNORD19B | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 5 |
| 218283_at | SS18L2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| 203282_at | GBE1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 |
| 212488_at | COL5A1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | |
| 219389_at | SUSD4 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 4 |
| 218115_at | ASF1B | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | |
| 201692_at | SIGMAR1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 5 |
Notes.
The 12 downregulated genes selected by most (≥6) of the methods are RPL15, HIGD1A, ARL6IP5, PRDX3, COPB2, CNIH1, PSMD6, LSM3, PDHB, SLC5A3, PTS and ASF1B. As gene HIGD1A is selected twice with different ID REF, it was only considered once here.
Results of GSEA analysis using the nine upregulated genes.
| Identifier | Highest normalized enrichment score (NES) | FDR | Status |
|---|---|---|---|
| 1.00 | 0.859 | Rejected | |
| 1.33 | 0.000 | ||
| 1.00 | 0.823 | Rejected | |
| 1.60 | 0.053 | ||
| 1.50 | 0.026 | ||
| AK026407 | – | – | – |
| 1.40 | 0.000 | ||
| – | – | – | |
| – | – | – |
Notes.
There was no matching of gene AK026407, TCF3 and TMEM115 with the MSigDB.
Results of GSEA analysis using the 12 downregulated genes.
| Identifier | Highest normalized enrichment score (ES) | FDR | Status |
|---|---|---|---|
| 1.21 | 0.779 | Rejected | |
| 1.25 | 1.00 | Rejected | |
| – | – | – | |
| 1.41 | 0.020 | ||
| 1.41 | 0.020 | ||
| – | – | – | |
| 1.44 | 0.560 | Rejected | |
| 1.41 | 0.020 | ||
| 1.33 | 0.839 | Rejected | |
| 1.50 | 0.000 | ||
| – | – | – | |
| 1.43 | 0.000 |
Notes.
There was no matching of gene ARL6IP5, CNIH1 and PTS with the MSigDB.
Figure 3Pathway network of the upregulated genes.
Figure 4Pathway network of the downregulated genes.
Function, related diseases and description of the most significant (upregulated).
| Genes | Full name | Function | Related diseases and description |
|---|---|---|---|
| Paternally Expressed Gene 3 | • protein coding gene. | • Hypermethylation of | |
| F-spondin 1 | • extracellular matrix organization regulation, interaction between cells and axon guidance ( | • Showed extreme expression activities in the identification of colorectal biomarkers ( | |
| bitonidase | • gene expression, proliferation and differentiation of cells, gene signaling ( | • gene expression signature that marks pelvic lymph node metastasis (PLNM) in cervical carcinoma ( | |
| Ribosomal Protein Lateral Stalk Subunit P2 | • encodes a 60s subunit ribosomal protein. | • prognostic marker for gastric cancer ( |
Function, related diseases and description of the most significant genes (downregulated).
| Genes | Full Name | Function | related diseases and description |
|---|---|---|---|
| Peroxiredoxin3 | • encodes antioxidant function protein, provides protection to mitochondria from oxidative stress ( | • showed correlation with the severity of the cervical carcinoma ( | |
| Coatomer Protein Complex Subunit Beta 2 | • protein coding gene. | • One of the differentially expressed genes in LNCaP prostate cancer cell lines ( | |
| Hypothetical protein LOC285378 | • Involves in the process of MRNA splicing | • Involved in the rapid proliferation, invasiveness, oxidative phosphorylation and tumor size of cervical cancers ( | |
| Solute Carrier Family 5 Member 3 | • Involves in the osmoregulation of cells ( | • differentially expressed between Parental SiHa Cells and SiHa/R Cells of cervical cancer ( | |
| Anti-Silencing Function 1B Histone Chaperone | • codes for the substrate protein of the cycle regulated-kinase cell. | • associated with the aggressiveness of breast tumor ( |