| Literature DB >> 35409258 |
Peng Zhao1, Zenglin Xu2,3, Junjie Chen2, Yazhou Ren1,4, Irwin King5.
Abstract
Single cell RNA sequencing (scRNA-seq) allows researchers to explore tissue heterogeneity, distinguish unusual cell identities, and find novel cellular subtypes by providing transcriptome profiling for individual cells. Clustering analysis is usually used to predict cell class assignments and infer cell identities. However, the performance of existing single-cell clustering methods is extremely sensitive to the presence of noise data and outliers. Existing clustering algorithms can easily fall into local optimal solutions. There is still no consensus on the best performing method. To address this issue, we introduce a single cell self-paced clustering (scSPaC) method with F-norm based nonnegative matrix factorization (NMF) for scRNA-seq data and a sparse single cell self-paced clustering (sscSPaC) method with l21-norm based nonnegative matrix factorization for scRNA-seq data. We gradually add single cells from simple to complex to our model until all cells are selected. In this way, the influences of noisy data and outliers can be significantly reduced. The proposed method achieved the best performance on both simulation data and real scRNA-seq data. A case study about human clara cells and ependymal cells scRNA-seq data clustering shows that scSPaC is more advantageous near the clustering dividing line.Entities:
Keywords: clustering; nonnegative matrix factorization; scRNA-seq; self-paced learning; sequencing data
Mesh:
Year: 2022 PMID: 35409258 PMCID: PMC8999118 DOI: 10.3390/ijms23073900
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Workflow for single cell self-paced clustering (scSPaC) and sparse single cell self-paced clustering (sscSPaC), which included data preprocessing, clustering and visualization. The pentagram in the figure represents the cluster center. The number of clusters is searched within a reasonable range (determined by an existing tool, SCANPY), and we discuss the impact of the cluster number on model performance in Section 3.3.
A summary of the scRNA-seq datasets used in this study.
| Datasets | # Clusters | # Cells | # Genes | Cluster Size | Reference |
|---|---|---|---|---|---|
| simulated data | 2 | 200 | 22002 |
| Splatter [ |
| baron | 14 | 1937 | 20125 |
| GSE84133 [ |
| kolodziejczyk | 3 | 704 | 32316 |
| E−MTAB−2600 [ |
| pollen | 11 | 301 | 20367 |
| SRP041736 [ |
| rca | 7 | 561 | 20949 |
| GSE81861 [ |
| goolam | 5 | 124 | 26670 |
| E−MTAB−3321 [ |
| zeisel | 9 | 3005 | 13845 |
| GSE60361 [ |
| cell lines | 4 | 1047 | 18666 |
| GSE126074 [ |
Evaluation of clustering performance on simulated data. The highest score for each dataset is shown in bold and the second best score is underlined. The values in the table represent the (mean ± std).
| Datasets | ARI | Purity | NMI |
|---|---|---|---|
| K-means | 0.45 ± 0.93 | 52.45 ± 2.96 | 0.89 ± 1.07 |
| NMF | 9.92 ± 9.72 | 64.03 ± 7.78 | 8.20 ± 7.30 |
| ONMF | 0.47 ± 1.01 | 52.50 ± 3.00 | 1.00 ± 1.27 |
| 0.64 ± 0.93 | 53.78 ± 2.74 | 1.29 ± 1.18 | |
| Seurat | 0.00 ± 0.00 | 54.83 ± 0.06 | 0.10 ± 0.01 |
| Scanpy | 0.20 ± 0.00 | 57.52 ± 0.08 | 3.67 ± 0.13 |
| SC3 | 10.79 ± 0.95 | 63.68 ± 5.72 | 9.26 ± 1.09 |
| scSPaC |
|
|
|
| sscSPaC |
|
|
|
Clustering results for ARI on real scRNA-seq data. The highest score for each dataset is shown in bold and the second best score is underlined. scSPaC and sscSPaC are based on the F-norm and -norm NMF with a self-paced learning single cell selection strategy.
| Datasets | Baron | Goolam | Kolodziejczyk | Pollen | Rca | Zeisel | Cell Line |
|---|---|---|---|---|---|---|---|
| K-means | 35.96 ± 4.44 | 15.73 ± 3.83 | 28.56 ± 15.33 | 62.55 ± 10.25 | 3.00 ± 0.22 | 10.12 ± 3.02 | 81.46 ± 4.36 |
| NMF | 49.73 ± 9.03 | 13.26 ± 6.60 | 37.38 ± 6.56 | 79.39 ± 4.88 | 11.33 ± 0.61 | 24.21 ± 2.96 | 79.85 ± 1.98 |
| ONMF | 50.03 ± 11.03 | 22.16 ± 4.48 | 40.73 ± 3.26 | 77.50 ± 4.51 | 6.83 ± 0.23 | 24.54 ± 4.89 | 80.29 ± 3.75 |
| 43.21 ± 4.16 | 33.61 ± 5.34 | 39.48 ± 2.23 | 76.66 ± 4.92 | 7.70 ± 0.98 | 35.83 ± 4.17 | 82.53 ± 4.26 | |
| Seurat | 61.82 ± 0.18 | 47.63 ± 0.08 |
| 81.82 ± 0.12 | 52.41 ± 0.08 | 52.73 ± 0.82 | 69.73 ± 0.12 |
| Scanpy | 74.91 ± 0.24 | 54.25 ± 0.16 | 45.37 ± 1.22 | 84.91 ± 0.10 | 54.5 ± 0.16 | 48.46 ± 0.92 | 82.61 ± 0.10 |
| SC3 |
| 57.52 ± 2.38 | 47.57 ± 3.64 | 49.78 ± 2.88 | 88.36 ± 5.14 | ||
| scSPaC |
| 48.90 ± 2.55 | 88.16 ± 3.73 | 57.02 ± 1.75 |
| ||
| sscSPaC | 78.84 ± 2.70 |
|
|
|
Figure 2ARI for all test datasets in this study. Bar: average ARI; Errbar: standard deviation of ARI values for 20 runs.
Clustering results for purity on real scRNA-seq data. The highest score for each dataset is shown in bold and the second best score is underlined.
| Datasets | Baron | Goolam | Kolodziejczyk | Pollen | Rca | Zeisel | Cell Line |
|---|---|---|---|---|---|---|---|
| K-means | 71.95 ± 2.18 | 57.66 ± 2.82 | 62.66 ± 9.25 | 77.54 ± 7.79 | 30.42 ± 0.19 | 49.57 ± 2.75 | 86.43 ± 0.12 |
| NMF | 82.56 ± 2.85 | 59.23 ± 2.79 | 68.27 ± 3.31 | 90.02 ± 3.18 | 31.37 ± 0.54 | 60.69 ± 2.46 | 81.74 ± 0.1 |
| ONMF | 80.92 ± 4.01 | 59.23 ± 1.95 | 69.49 ± 1.4 | 88.34 ± 3.8 | 31.01 ± 0.4 | 58.34 ± 2.26 | 82.18 ± 0.01 |
| 92.35 ± 1.47 | 70.85 ± 4.16 | 69.22 ± 0.94 | 91.01 ± 1.74 | 32.07 ± 1.13 | 66.3 ± 2.64 | 87.81 ± 0.1 | |
| Seurat | 86.15 ± 0.26 | 72.18 ± 0.04 |
| 86.15 ± 0.17 | 72.91 ± 0.01 | 51.99 ± 0.02 | 79.52 ± 0.04 |
| Scanpy | 87.89 ± 0.06 | 75.63 ± 0.64 | 76.44 ± 0.1 | 93.69 ± 0.06 | 78.59 ± 0.64 | 50.68 ± 0.1 | 88.41 ± 0.03 |
| SC3 | 90.72 ± 2.28 | 76.59 ± 2.76 | 78.13 ± 3.51 | 94.95 ± 2.76 | 78.14 ± 3.01 | 92.75 ± 0.09 | |
| scSPaC |
| 79.03 ± 3.48 | 83.22 ± 2.07 | ||||
| sscSPaC |
|
|
|
|
|
Clustering results for NMI on real scRNA-seq data. The highest score for each dataset is shown in bold and the second best score is underlined.
| Datasets | Baron | Goolam | Kolodziejczyk | Pollen | Rca | Zeisel | Cell Line |
|---|---|---|---|---|---|---|---|
| K-means | 42.77 ± 3.74 | 20.2 ± 5.43 | 32.85 ± 16.3 | 80.57 ± 6.05 | 1.39 ± 0.19 | 19.15 ± 3.56 | 79.47 ± 2.39 |
| NMF | 62.11 ± 4.29 | 17.34 ± 6.42 | 42.43 ± 5.59 | 91.09 ± 2.4 | 2.62 ± 0.72 | 35.53 ± 2.22 | 80.81 ± 3.73 |
| ONMF | 60.77 ± 4.89 | 16.07 ± 3.87 | 44.33 ± 2.65 | 89.94 ± 2.84 | 2.15 ± 0.5 | 33.48 ± 2.86 | 80.45 ± 2.11 |
| 64.75 ± 1.93 | 51.95 ± 4.02 | 44.15 ± 1.78 |
| 5.98 ± 1.38 | 38.76 ± 2.34 | 84.86 ± 2.91 | |
| Seurat | 61.57 ± 0.23 | 43.23 ± 0.07 | 51.54 ± 0.02 | 86.11 ± 0.07 | 38.92 ± 0.04 | 52.03 ± 0.02 | 63.62 ± 0.07 |
| Scanpy | 73.98 ± 0.22 | 54.9 ± 0.07 | 49.56 ± 0.03 | 89.33 ± 0.12 | 36.02 ± 0.03 | 44.25 ± 0.03 | 80.46 ± 0.12 |
| SC3 |
| 56.59 ± 3.13 | 52.67 ± 6.64 | 91.25 ± 3.4 |
| 50.01 ± 4.28 | 82.75 ± 3.14 |
| scSPaC | 79.82 ± 3.48 |
| 89.09 ± 1.86 | 51.70 ± 0.41 |
| ||
| sscSPaC |
|
|
|
Figure 3The clustering performance (ARI) with different high variable genes (HVGs). Each broken line represents the ARI of a dataset with 200–2500 high variable genes.
Changes in ARI values calculated according to different cluster number K in simulated data and 7 real scRNA-seq datasets. “Ref. K” means reference K, the number of provided single cell types. “–” means the number of clusters is less than 2. The bold number indicate the best performance (ARI) of each dataset calculated according to different K.
| ARI around Evaluate K by Scanpy (K | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Datasets | Ref. K | Evaluate K | Best K | K − 3 | K − 2 | K − 1 | K | K + 1 | K + 2 | K + 3 |
| simulated data | 2 | 2 | 2 | – | – | – |
| 0.2448 | 0.2567 | 0.2489 |
| baron | 14 | 13 | 11 | 0.7808 |
| 0.8319 | 0.8094 | 0.7862 | 0.8249 | 0.7727 |
| goolam | 5 | 5 | 5 | 0.4227 | 0.4518 | 0.4615 |
| 0.5758 | 0.5661 | 0.5732 |
| Kolodziejczyk | 3 | 8 | 5 |
| 0.4863 | 0.4875 | 0.4671 | 0.4679 | 0.4628 | 0.4605 |
| pollen | 11 | 8 | 10 |
| 0.7172 | 0.7893 | 0.8764 | 0.8753 |
| 0.8612 |
| Rca | 7 | 9 | 8 | 0.5475 | 0.5419 |
| 0.5671 | 0.5623 | 0.5453 | 0.5286 |
| zeisel | 9 | 13 | 10 |
| 0.6246 | 0.6241 | 0.6078 | 0.5793 | 0.5641 | 0.5632 |
| cell line | 4 | 4 | 4 | – | 0.5468 | 0.7025 |
| 0.9043 | 0.9102 | 0.8954 |
Figure 4t-SNE for pulmonary alveolar type II, clara and ependymal cells of human scRNA-seq data cluster results. The red filled circles represent clara cells and the blue filled triangles represent ependymal cells. (a) t-SNE for K-means; (b) t-SNE for origin NMF; (c) t-SNE for single cell self-paced clustering (scSPaC); (d) t-SNE for ground truth.