| Literature DB >> 35798813 |
Cheng-Han Chua1, Meihui Guo1, Shih-Feng Huang2.
Abstract
This paper proposes a KC Score to measure feature importance in clustering analysis of high-dimensional data. The KC Score evaluates the contribution of features based on the correlation between the original features and the reconstructed features in the low dimensional latent space. A KC Score-based feature selection strategy is further developed for clustering analysis. We investigate the performance of the proposed strategy by conducting a study of four single-cell RNA sequencing (scRNA-seq) datasets. The results show that our strategy effectively selects important features for clustering. In particular, in three datasets, our proposed strategy selected less than 5% of the features and achieved the same or better clustering performance than when using all of the features.Entities:
Mesh:
Year: 2022 PMID: 35798813 PMCID: PMC9263137 DOI: 10.1038/s41598-022-15529-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Description of four scRNA-seq datasets.
| Data name | Class # | Description | ||
|---|---|---|---|---|
| mECS | 182 | 8989 | 3 | Embryonic stem cells under different cell cycle stages |
| Kolod | 704 | 13,473 | 3 | Pluripotent cells under different environment conditions |
| Pollen | 249 | 6982 | 11 | Eleven cell populations including neural cells and blood cells |
| Usoskin | 622 | 17,772 | 4 | Neuronal cells with sensory subtypes |
Figure 1Flow chart of Strategy A.
, , , NNA, RFA and NMI based on , and for the four data sets.
| Data set | Latent space projection | NNA | RFA | NMI | |||
|---|---|---|---|---|---|---|---|
| mECS | 2860 | 50.49 | 31.82 | 0.97 | 0.96 | 0.84 | |
| ( | 5595 | 69.89 | 62.24 | 0.95 | 0.96 | 0.85 | |
| 0.95 | 0.95 | 0.89 | |||||
| Kolod | 2 | 8 | 0.01 | 1.00 | 1.00 | 1.00 | |
| ( | 10 | 28.57 | 0.07 | 1.00 | 1.00 | 1.00 | |
| 1.00 | 1.00 | 0.99 | |||||
| Pollen | 225 | 100 | 3.22 | 0.98 | 0.98 | 0.94 | |
| ( | 115 | 100 | 1.65 | 0.98 | 0.98 | 0.91 | |
| 0.98 | 0.95 | 0.95 | |||||
| Usoskin | 65 | 41.94 | 0.37 | 0.99 | 0.99 | 0.96 | |
| ( | 55 | 73.33 | 0.31 | 0.98 | 0.98 | 0.93 | |
| 0.94 | 0.96 | 0.74 |
Figure 2The NMIs based on the two ’s selected respectively by the KC and Laplacian Scores in Table 2.
Figure 3The NNA and NMI curves of Strategy A against different numbers(log scale) of genes based on the KC Score(single Gaussian kernel; SIMLR) versus Laplacian Score(single Gaussian kernel; SIMLR) for the four scRNA-seq datasets, where the circles in each subplot denote the locations of .
Figure 4Scatter plot of the first two (the 9708th and 11221st) critical cell gene expressions for the Kolod dataset.
Figure 5Three 2-dimensional latent spaces obtained by t-SNE for the Usoskin dataset: (a) , (b) , and (c) .
The running hours spent in steps 1 and 2 of Strategy A when applying different methods to the four datasets.
| Dataset | KC Score (single Gaussian Kernel) | KC Score (SIMLR) | Laplacian Score (single Gaussian Kernel) | Laplacian Score (SIMLR) | |
|---|---|---|---|---|---|
| mECS | 182 | 0.09 | 5.90 | 0.09 | 5.93 |
| Kolod | 704 | 2.76 | 64.17 | 2.70 | 68.09 |
| Pollen | 249 | 0.12 | 7.21 | 0.12 | 7.33 |
| Usoskin | 622 | 3.39 | 76.02 | 3.46 | 77.43 |
Figure 6The MANOVA Pillai’s Trace statistic curves of Strategy A against different numbers (log scale) of genes based on the KC Score (single Gaussian kernel; SIMLR) versus Laplacian Score (single Gaussian kernel; SIMLR) for the four scRNA-seq datasets, where the circles denote the locations of the ’s for each method.
Figure 7The confusion table of the clustering result of , where the clusters are rearranged with the highest accuracy.