| Literature DB >> 35157031 |
Geert-Jan Huizing1,2, Gabriel Peyré2, Laura Cantini1.
Abstract
MOTIVATION: High-throughput single-cell molecular profiling is revolutionizing biology and medicine by unveiling the diversity of cell types and states contributing to development and disease. The identification and characterization of cellular heterogeneity is typically achieved through unsupervised clustering, which crucially relies on a similarity metric.Entities:
Year: 2022 PMID: 35157031 PMCID: PMC9004651 DOI: 10.1093/bioinformatics/btac084
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Summary of the datasets used for our benchmark and their clustering results
| Dataset name | Data description | Clustering results | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Data type | Reference | Number of cells | Number of features after preprocessing | Number of clusters in ground-truth | Detected number of clusters for hierarchical clustering (OT) | Detected number of clusters for hierarchical clustering (Pearson) | Detected number of clusters for Leiden clustering (OT) | Detected number of clusters for Leiden clustering (PCA + Euclidean) | |
| 500 cells | Simulated scRNA- seq data | Splatter | 500 | 5000 | 3 | 3 | 3 | 3 | 11 |
| 1000 cells | Splatter | 1000 | 5000 | 3 | 3 | 3 | 3 | 12 | |
| 10 000 cells | Splatter | 10 000 | 5000 | 3 | 3 | 5 | 9 | 12 | |
| Unbalanced clusters | Splatter | 1000 | 5000 | 3 | 3 | 3 | 5 | 8 | |
| Overlapping clusters | Splatter | 1000 | 5000 | 3 | 3 | 3 | 3 | 12 | |
| Liu scRNA | scRNA-seq | Liu | 206 | 10 000 | 3 | 3 | 3 | 3 | 3 |
| Li Tumor | Li | 364 | 10 000 | 7 | 3 | 6 | 2 | 2 | |
| Li NM | Li | 266 | 10 000 | 7 | 9 | 23 | 6 | 4 | |
| Li cell lines | Li | 561 | 10 000 | 7 | 8 | 10 | 9 | 7 | |
| Liu scATAC | scATAC-seq | Liu | 206 | 10 000 | 3 | 3 | 3 | 3 | 7 |
| Leukemia scATAC | Corces | 391 | 7602 | 6 | 3 | 25 | 4 | 3 | |
| scMethylation mouse | Single-cell DNA methylation | Luo | 3377 | 10 000 | 16 | 3 | 3 | 14 | 18 |
| scMethylation human | Luo | 2740 | 10 000 | 21 | 3 | 7 | 16 | 32 | |
Note: In the first part of the table (‘Data description’), for each dataset, we specify the name with which we denote it in the paper, the reference to its original publication, the type of data, the number of cells, the number of features after preprocessing and the ground-truth number of clusters (e.g. cell types, cell lines) present in the data. In the second part of the table (‘Clustering results’), we report the number of clusters obtained by maximizing the silhouette score, with hierarchical clustering (for Pearson correlation and OT distance) and for a typical single-cell clustering workflow based on Leiden clustering (for the Euclidean distance on PCA components and for the OT distance).
Workflow for metrics comparison. The employed procedure, from the input preprocessed data to the performance evaluation is summarized for (A) baseline metrics and (B) OT, respectively. The graphic contents in the figure are taken from flaticon.com
Fig. 2.Comparison of OT against Pearson correlation in cell–cell similarity inference. Barplots for C-index and Silhouette score are reported for (A) simulated scRNA-seq data composed of 500, 1000 and 10 000 cells, with unbalanced groups and overlapping clusters; (B) four scRNA-seq datasets; (D) two single-cell DNA methylation and two scATAC-seq data. Examples of the distance matrices obtained with OT, Pearson correlation and Euclidean distance in Liu scRNA-seq are reported in (C)
Fig. 3.Comparison of OT against Pearson correlation in hierarchical clustering. Barplots for ARI and NMI are reported for (A) simulated scRNA-seq data composed of 500, 1000 and 10 000 cells, with unbalanced groups and overlapping clusters; (B) four scRNA-seq datasets; (C) two single-cell DNA methylation and two scATAC-seq data
Fig. 4.Comparison of a typical single-cell clustering workflow against its counterpart based on OT. Barplots for ARI and NMI are reported for (A) simulated scRNA-seq data composed of 500, 1000 and 10 000 cells, with unbalanced groups and overlapping clusters; (B) four scRNA-seq datasets; (C) two single-cell DNA methylation and two scATAC-seq data