| Literature DB >> 29844511 |
Jian Liu1, Yuhu Cheng1, Xuesong Wang2, Lin Zhang1, Z Jane Wang3.
Abstract
Identification of characteristic genes associated with specific biological processes of different cancers could provide insights into the underlying cancer genetics and cancer prognostic assessment. It is of critical importance to select such characteristic genes effectively. In this paper, a novel unsupervised characteristic gene selection method based on sample learning and sparse filtering, Sample Learning based on Deep Sparse Filtering (SLDSF), is proposed. With sample learning, the proposed SLDSF can better represent the gene expression level by the transformed sample space. Most unsupervised characteristic gene selection methods did not consider deep structures, while a multilayer structure may learn more meaningful representations than a single layer, therefore deep sparse filtering is investigated here to implement sample learning in the proposed SLDSF. Experimental studies on several microarray and RNA-Seq datasets demonstrate that the proposed SLDSF is more effective than several representative characteristic gene selection methods (e.g., RGNMF, GNMF, RPCA and PMD) for selecting cancer characteristic genes.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29844511 PMCID: PMC5974408 DOI: 10.1038/s41598-018-26666-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The differences between sample learning and feature learning. (a) A feature learning model for the lung cancer dataset. (b) A sample learning model for the lung cancer dataset.
Figure 2The framework of sample learning with SLDSF on gene expression data.
The SLDSF algorithm.
| Input: Gene expression dataset: |
| Initialize |
Summary of gene expression datasets.
| Dataset | Name | Number of | |||
|---|---|---|---|---|---|
| Genes | Samples | Classes | |||
| Microarray | Lung Cancer | Lung adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas cases, normal lung samples | 12600 | 203 | 5 |
| Leukemia | Acute myelogenous leukemia, acute lymphoblastic leukemia | 5000 | 38 | 2 | |
| DLBCL | ‘Cured’ patients, ‘fatal/refractory’ patients | 7129 | 58 | 2 | |
| RNA-Seq | ESCA | Diseased samples, normal samples | 20502 | 192 | 2 |
| HNSC | Diseased samples, normal samples | 20502 | 418 | 2 | |
The P-Values of GO terms corresponding to different methods on the lung cancer dataset.
| ID | Name | SLDSF | RGNMF | GNMF | RPCA | PMD |
|---|---|---|---|---|---|---|
| P-Value | P-Value | P-Value | P-Value | P-Value | ||
| GO:0000184 | nuclear-transcribed mRNA catabolic process, nonsense-mediated decay | 2.16E-16 | 3.16E-16 | None | 5.24E-15 | |
| GO:0006614 | SRP-dependent cotranslational protein targeting to membrane | 2.77E-16 | 4.04E-16 | None | 6.58E-15 | |
| GO:0006613 | cotranslational protein targeting to membrane | 4.47E-16 | 6.53E-16 | None | 1.02E-14 | |
| GO:0045047 | protein targeting to ER | 7.09E-16 | 1.04E-15 | None | 1.56E-14 | |
| GO:0072599 | establishment of protein localization to endoplasmic reticulum | 9.91E-16 | 1.45E-15 | None | 2.12E-14 | |
| GO:0070972 | protein localization to endoplasmic reticulum | 5.15E-15 | 7.50E-15 | None | 9.63E-14 | |
| GO:0019080 | viral gene expression | 3.47E-14 | 5.19E-14 | None | 4.49E-13 | |
| GO:0044033 | multi-organism metabolic process | 6.77E-14 | 1.01E-13 | None | 1.01E-13 | |
| GO:0019083 | viral transcription | 3.91E-13 | 5.66E-13 | None | 5.14E-12 | |
| GO:0006415 | translational termination | 5.94E-15 | 8.91E-15 | None | 8.79E-14 |
The P-Values of GO terms corresponding to different methods on the leukemia dataset.
| ID | Name | SLDSF | RGNMF | GNMF | RPCA | PMD |
|---|---|---|---|---|---|---|
| P-Value | P-Value | P-Value | P-Value | P-Value | ||
| GO:0006955 | immune response | 4.14E-12 | 2.76E-11 | 3.45E-15 | 1.83E-11 | |
| GO:0001775 | cell activation | 8.94E-18 | 1.40E-14 | 1.35E-13 | 8.60E-13 | |
| GO:0045321 | leukocyte activation | 5.89E-13 | 5.34E-11 | 4.72E-16 | 4.01E-11 | |
| GO:0007159 | leukocyte cell-cell adhesion | 3.56E-13 | 4.58E-15 | 6.05E-14 | 4.07E-11 | |
| GO:0046649 | lymphocyte activation | 3.13E-12 | 2.63E-09 | 2.95E-15 | 2.43E-11 | |
| GO:0016337 | single organismal cell-cell adhesion | 2.86E-12 | 2.02E-09 | 4.44E-12 | 2.10E-12 | |
| GO:0034109 | homotypic cell-cell adhesion | 1.05E-12 | 1.34E-09 | 1.26E-14 | 1.05E-10 | |
| GO:0070486 | leukocyte aggregation | 1.60E-12 | 2.40E-09 | 2.00E-14 | 1.82E-10 | |
| GO:0098602 | single organism cell adhesion | 1.01E-12 | 7.14E-10 | 1.42E-11 | 7.25E-13 | |
| GO:0050776 | regulation of immune response | 7.66E-11 | 4.01E-09 | 1.13E-12 | 5.59E-11 |
The P-Values of GO terms corresponding to different methods on the DLBCL dataset.
| ID | Name | SLDSF | RGNMF | GNMF | RPCA | PMD |
|---|---|---|---|---|---|---|
| P-Value | P-Value | P-Value | P-Value | P-Value | ||
| GO:0006614 | SRP-dependent cotranslational protein targeting to membrane | 4.29E-90 | 3.66E-91 | 1.94E-35 | 2.65E-92 | |
| GO:0006613 | cotranslational protein targeting to membrane | 1.23E-89 | 1.05E-90 | 3.03E-35 | 7.62E-92 | |
| GO:0045047 | protein targeting to ER | 9.48E-89 | 8.10E-90 | 7.19E-35 | 5.87E-91 | |
| GO:0072599 | establishment of protein localization to endoplasmic reticulum | 6.65E-88 | 5.69E-89 | 1.65E-34 | 4.12E-90 | |
| GO:0000184 | nuclear-transcribed mRNA catabolic process, nonsense-mediated decay | 2.72E-87 | 2.32E-88 | 2.46E-36 | 1.68E-89 | |
| GO:0070972 | protein localization to endoplasmic reticulum | 2.51E-84 | 2.15E-85 | 5.78E-33 | 1.56E-86 | |
| GO:0006414 | translational elongation | 1.84E-79 | 1.26E-80 | 2.02E-30 | 1.57E-80 | |
| GO:0006415 | translational termination | 2.51E-78 | 2.16E-79 | 2.61E-30 | 2.80E-80 | |
| GO:0019080 | viral gene expression | 5.62E-78 | 4.33E-79 | 7.12E-31 | 2.67E-79 | |
| GO:0044033 | multi-organism metabolic process | 6.82E-77 | 5.27E-78 | 3.02E-30 | 3.40E-79 |
Figure 3Venn diagram of genes selected by five methods on (a) lung cancer dataset, (b) leukemia dataset and (c) DLBCL dataset.
The P-Values of GO terms corresponding to different methods on the ESCA dataset.
| ID | Name | SLDSF | RPCA | PMD |
|---|---|---|---|---|
| P-Value | P-Value | P-Value | ||
| GO:0042060 | wound healing | 8.20E-13 | 7.56E-12 | |
| GO:0009611 | response to wounding | 4.01E-10 | 4.01E-10 | |
| GO:0022610 | biological adhesion | 2.01E-12 | 3.37E-13 | |
| GO:0006955 | immune response | 9.95E-11 | 9.95E-11 | |
| GO:0007155 | cell adhesion | 9.34E-12 | 1.63E-12 | |
| GO:0043588 | skin development | None | ||
| GO:0007010 | cytoskeleton organization | 1.39E-08 | ||
| GO:0050776 | regulation of immune response | 6.12E-10 | 3.70E-09 | |
| GO:0034109 | homotypic cell-cell adhesion | 1.59E-08 | ||
| GO:0098609 | cell-cell adhesion | 3.04E-09 | 3.04E-09 |
The P-Values of GO terms corresponding to different methods on the HNSC dataset.
| ID | Name | SLDSF | RPCA | PMD |
|---|---|---|---|---|
| P-Value | P-Value | P-Value | ||
| GO:0042060 | wound healing | 5.38E-11 | 1.69E-11 | |
| GO:0031581 | hemidesmosome assembly | 2.27E-09 | None | |
| GO:0009611 | response to wounding | 1.09E-08 | 2.88E-08 | |
| GO:0022610 | biological adhesion | 5.73E-09 | 9.48E-10 | |
| GO:0034330 | cell junction organization | 5.69E-10 | 1.25E-07 | |
| GO:0043588 | skin development | 1.24E-11 | 7.65E-18 | |
| GO:0007010 | cytoskeleton organization | 2.56E-07 | 6.43E-07 | |
| GO:0034329 | cell junction assembly | 5.69E-10 | 1.19E-06 | |
| GO:0045104 | intermediate filament cytoskeleton organization | 6.83E-11 | 8.77E-11 | |
| GO:0007155 | cell adhesion | 2.21E-08 | 7.91E-10 |
Figure 4The Venn diagram of genes selected by three methods on (a) ESCA dataset and (b) HNSC dataset.