| Literature DB >> 35846121 |
Guoyun Liu1, Manzhi Li1,2, Hongtao Wang1, Shijun Lin1, Junlin Xu3, Ruixi Li4, Min Tang5, Chun Li1.
Abstract
A single-cell sequencing data set has always been a challenge for clustering because of its high dimension and multi-noise points. The traditional K-means algorithm is not suitable for this type of data. Therefore, this study proposes a Dissimilarity-Density-Dynamic Radius-K-means clustering algorithm. The algorithm adds the dynamic radius parameter to the calculation. It flexibly adjusts the active radius according to the data characteristics, which can eliminate the influence of noise points and optimize the clustering results. At the same time, the algorithm calculates the weight through the dissimilarity density of the data set, the average contrast of candidate clusters, and the dissimilarity of candidate clusters. It obtains a set of high-quality initial center points, which solves the randomness of the K-means algorithm in selecting the center points. Finally, compared with similar algorithms, this algorithm shows a better clustering effect on single-cell data. Each clustering index is higher than other single-cell clustering algorithms, which overcomes the shortcomings of the traditional K-means algorithm.Entities:
Keywords: Dissimilarity matrix; K-means; ScRNA-seq; density; dynamic radius
Year: 2022 PMID: 35846121 PMCID: PMC9284269 DOI: 10.3389/fgene.2022.912711
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Dissimilarity density .
FIGURE 2Dissimilarity of candidate clusters .
FIGURE 3Algorithm block diagram.
FIGURE 4Clustering at a fixed radius. (A): The radius is too small; (B): The radius is too large; (C): The radius is appropriate.
Summary of six scRNA-seq data sets used in this study.
| Data set | The number of cells | The number of genes | The number of clusters |
|---|---|---|---|
| Kolod | 704 | 10685 | 3 |
| Pollen | 249 | 14805 | 11 |
| Ting | 114 | 11405 | 5 |
| Ioh | 429 | 18087 | 8 |
| Goolam | 124 | 16384 | 5 |
| Usoskin | 622 | 17772 | 4 |
| Xin | 1600 | 39851 | 8 |
| Zeisel | 3005 | 4412 | 48 |
| Macosko | 6418 | 12822 | 39 |
Clustering indexes after dimensionality reduction.
| Kolod | Pollen | Usoskin | Ting | loh | Goolam | Xin | Zeisel | Macosko | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Original Data | NMI | 0.5202 | 0.8533 | 0.3139 | 0.7262 | 0.5512 | 0.6218 | 0.5338 | 0.5262 | 0.4772 |
| FM | 0.8207 | 0.7837 | 0.5923 | 0.534 | 0.6013 | 0.7605 | 0.5468 | 0.3260 | 0.3726 | |
| Accuracy | 0.6960 | 0.7807 | 0.5907 | 0.7746 | 0.5734 | 0.8097 | 0.8744 | 0.4985 | 0.4399 | |
| RandIndex | 0.7080 | 0.9323 | 0.7011 | 0.8370 | 0.7924 | 0.8140 | 0.6971 | 0.9230 | 0.9092 | |
| t-SNE | NMI |
|
|
|
|
|
|
|
|
|
| FM |
|
|
|
|
|
|
|
|
| |
| Accuracy |
|
|
|
|
|
|
|
|
| |
| RandIndex |
|
|
|
|
|
|
|
|
| |
| PCA | NMI | 0.5557 | 0.8190 | 0.3435 | 0.8318 | 0.6398 | 0.6674 | 0.5821 | 0.4031 | 0.3433 |
| FM | 0.7710 | 0.8013 | 0.5486 | 0.9077 | 0.6616 | 0.7779 | 0.5873 | 0.2254 | 0.2456 | |
| Accuracy | 0.7685 | 0.8233 | 0.5723 | 0.8947 | 0.6727 | 0.8653 | 0.9175 | 0.4254 | 0.3398 | |
| RandIndex | 0.7905 | 0.9475 | 0.6837 | 0.9127 | 0.8648 | 0.8679 | 0.7119 | 0.9188 | 0.9185 | |
| MDS | NMI | 0.5519 | 0.8123 | 0.3438 | 0.8228 | 0.6444 | 0.7202 | 0.5960 | 0.4033 | 0.3441 |
| FM | 0.7679 | 0.5588 | 0.5588 | 0.8429 | 0.6674 | 0.7927 | 0.6046 | 0.2255 | 0.2420 | |
| Accuracy | 0.7648 | 0.5723 | 0.5723 | 0.8596 | 0.6681 | 0.8871 | 0.9219 | 0.4252 | 0.3363 | |
| RandIndex | 0.7883 | 0.6845 | 0.6845 | 0.9067 | 0.8647 | 0.9078 | 0.7288 | 0.9189 | 0.9163 | |
| Isomap | NMI | 0.4574 | 0.7350 | 0.3686 | 0.9173 | 0.7812 | 0.6535 | 0.6002 | 0.5338 | 0.5063 |
| FM | 0.7797 | 0.6632 | 0.6709 | 0.8064 | 0.8292 | 0.7295 | 0.5852 | 0.3307 | 0.4182 | |
| Accuracy | 0.7741 | 0.6908 | 0.6672 | 0.8684 | 0.8436 | 0.8734 | 0.9207 | 0.5196 | 0.4634 | |
| RandIndex | 0.7590 | 0.9070 | 0.7372 | 0.9104 | 0.9355 | 0.8173 | 0.7240 | 0.9251 | 0.9133 | |
| LLE | NMI | 0.5358 | 0.8941 | 0.4951 | 0.8172 | 0.7867 | 0.7205 | 0.5831 | 0.5719 | 0.6020 |
| FM | 0.8006 | 0.8931 | 0.7353 | 0.8458 | 0.7843 | 0.3620 | 0.6042 | 0.3620 | 0.5398 | |
| Accuracy | 0.7955 | 0.9076 | 0.7267 | 0.8772 | 0.8462 | 0.5237 | 0.8882 | 0.5237 | 0.5734 | |
| RandIndex | 0.7897 | 0.9695 | 0.7841 | 0.8763 | 0.9225 | 0.8978 | 0.7240 | 0.8978 | 0.9405 | |
| LPP | NMI | 0.7105 | 0.8875 | 0.6887 | 0.7869 | 0.7709 | 0.7056 | 0.5543 | 0.4819 | 0.4517 |
| FM | 0.7977 | 0.8460 | 0.8559 | 0.8351 | 0.7449 | 0.7991 | 0.5506 | 0.2664 | 0.3275 | |
| Accuracy | 0.7979 | 0.8594 | 0.8376 | 0.8509 | 0.8089 | 0.8790 | 0.9006 | 0.4516 | 0.4020 | |
| RandIndex | 0.7925 | 0.9620 | 0.8680 | 0.8572 | 0.8693 | 0.8996 | 0.7036 | 0.9197 | 0.9232 |
FIGURE 5Clustering index values of different dimension clustering.
FIGURE 6Index of the D3K algorithm in single-cell data aggregation class.
FIGURE 7D3K algorithm visualization analysis.
FIGURE 8Clustering results when T and not T are set.
FIGURE 9Deng data set gene marker results.