| Literature DB >> 30700040 |
Xiaoshu Zhu1,2, Hong-Dong Li3, Yunpei Xu4, Lilu Guo5, Fang-Xiang Wu6, Guihua Duan7, Jianxin Wang8.
Abstract
Single-cell RNA sequencing (scRNA-seq) has recently brought new insight into cell differentiation processes and functional variation in cell subtypes from homogeneous cell populations. A lack of prior knowledge makes unsupervised machine learning methods, such as clustering, suitable for analyzing scRNA-seq . However, there are several limitations to overcome, including high dimensionality, clustering result instability, and parameter adjustment complexity. In this study, we propose a method by combining structure entropy and k nearest neighbor to identify cell subpopulations in scRNA-seq data. In contrast to existing clustering methods for identifying cell subtypes, minimized structure entropy results in natural communities without specifying the number of clusters. To investigate the performance of our model, we applied it to eight scRNA-seq datasets and compared our method with three existing methods (nonnegative matrix factorization, single-cell interpretation via multikernel learning, and structural entropy minimization principle). The experimental results showed that our approach achieves, on average, better performance in these datasets compared to the benchmark methods.Entities:
Keywords: clustering; k nearest neighbor; multikernel learning; single-cell RNA-seq; structure entropy; unsupervised learning
Mesh:
Year: 2019 PMID: 30700040 PMCID: PMC6409843 DOI: 10.3390/genes10020098
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The mechanism of the SSE (single-cell structure entropy minimization principle) algorithm. The input is a gene expression matrix. The SSE algorithm includes three steps: (1) The similarity is calculated by multikernel learning; (2) the cell network is constructed by KNN (k nearest neighbor); (3) clustering is implemented using the structure entropy minimized principle. Lastly, gene priority ranking results as an output.
List of datasets and their attributes.
| GSE/ID | Datasets | Tissue | Number of Cells | Number of Genes | Amount of Population | References |
|---|---|---|---|---|---|---|
| GSE57249 | Biase | Mouse embryo cell | 49 | 25,384 | 3 | Biase et al., 2014 [ |
| GSE36552 | Yan | Human embryo cell | 90 | 20,214 | 6 | Yan et al., 2013 [ |
| GSE45719 | Deng | Mouse embryo cell | 259 | 22,147 | 10 | Deng et al., 2014 [ |
| E-MTAB-2805 | Pollen | Human different tissues (stem cell) | 249 | 14,805 | 11 | Pollen et al., 2014 [ |
| GSE52583 | Treutlein | Mouse lung epithelial cell | 80 | 23,129 | 5 | Treutlein et al., 2014 [ |
| GSE57872 | Patel | Human glioblastoma cells | 430 | 5948 | 5 | Patel et al., 2014 [ |
| GSE75688 | Chung | Human breast cancer and lymph node metastasis cells | 518 | 41,821 | 4 | Chung et al., 2017 [ |
| GSE38495 | Ramskold | Human cancer cell | 33 | 21,042 | 7 | Ramsköld et al., 2012 [ |
Cluster performance comparison of NMF (nonnegative matrix factorization), SIMLR (single-cell interpretation via multikernel learning), SE (structural entropy minimization principle), and SSE (single-cell structural entropy minimization principle) in terms of NMI (Normalized mutual information).
| Datasets | NMF | SIMLR | SE | SSE |
|---|---|---|---|---|
| Biase | 0.322 | 0.673 | 0.554 |
|
| Yan | 0. 673 | 0.727 |
| 0.747 |
| Deng | 0.509 |
| 0.635 |
|
| Pollen | 0.944 |
| 0.781 |
|
| Treutlein | 0.277 | 0.276 |
| 0.270 |
| Patel | NA | 0.576 | NA |
|
| Chung | 0.196 | 0.283 | 0.322 |
|
| Ramskold |
| 0.818 | 0.596 | 0.772 |
| Average | 0.536 | 0.622 | 0.573 |
|
Cluster performance comparison of NMF, SIMLR, SE and SSE in terms of ARI (Adjusted Rand index).
| Datasets | NMF | SIMLR | SE | SSE |
|---|---|---|---|---|
| Biase | 0.244 | 0.682 | 0.682 |
|
| Yan | 0.519 | 0.487 | 0.477 |
|
| Deng | 0.312 | 0.364 | 0.388 |
|
| Pollen | 0.981 |
| 0.613 |
|
| Treutlein | 0.262 |
| 0.183 | 0.155 |
| Patel | NA | 0.527 | NA |
|
| Chung | 0.134 | 0.136 |
| 0.158 |
| Ramskold |
| 0.683 | 0.344 | 0.613 |
| Average | 0.448 | 0.506 | 0.412 |
|
Figure 2The heat maps of Biase datasets.
Figure 3The heat maps of Yan datasets.
Figure 4The heat maps of Deng datasets.
Figure 5The heat maps of Pollen datasets.
Figure 6The heat maps of Treutlen datasets.
Figure 7The heat maps of Patel datasets.
Figure 8The heat maps of Chung datasets.
Figure 9The heat maps of Ramskold datasets.
The number of clusters in the ‘gold standard’ and four methods.
| Datasets | Gold Standard | NMF | SIMLR | SE | SSE |
|---|---|---|---|---|---|
| Biase | 3 | 3 | 3 | 5 | 3 |
| Yan | 6 | 6 | 6 | 11 | 7 |
| Deng | 10 | 10 | 10 | 8 | 13 |
| Pollen | 11 | 11 | 11 | 7 | 11 |
| Treutlein | 5 | 5 | 5 | 4 | 6 |
| Patel | 5 | 5 | 5 | NA | 15 |
| Chung | 4 | 4 | 4 | 11 | 21 |
| Ramskold | 7 | 7 | 7 | 3 | 5 |
Figure 10The scatter diagram of eight datasets by principal component analysis (PCA). (a) Biase; (b) Yan; (c) Deng; (d) Pollen; (e) Treutlein; (f) Patel; (g) Chung; (h) Ramskold.