| Literature DB >> 29950006 |
Maziyar Baran Pouyan1, Dennis Kostka1,2.
Abstract
Motivation: Genome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore, obtaining accurate cell-cell similarities from scRNA-seq data is a critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.Entities:
Mesh:
Year: 2018 PMID: 29950006 PMCID: PMC6022547 DOI: 10.1093/bioinformatics/bty260
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Nearest neighbor error values for dimension reduction (in percent, lower is better)
| Method | Patel | Buttener | Engel | Kolod | Goolam | Usoskin | Treutlein | Leng | Pollen | Lin | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|
| RAFSIL1-tSNE | 3.8 | 0.5 | 4.0 | 1.0 | 7.5 | 2.7 | |||||
| RAFSIL1-PCA | 8.1 | 4.4 | 11.3 | 9.7 | 21.5 | 12.5 | 26.5 | 12.6 | 24.9 | 13.2 | |
| RAFSIL1-pPCA | 7.7 | 4.4 | 11.3 | 9.7 | 22.5 | 15.0 | 25.4 | 12.3 | 24.4 | 13.3 | |
| RAFSIL2-tSNE | 2.7 | 4.8 | 6.2 | 4.6 | 4.0 | 9.2 | 3.4 | ||||
| RAFSIL2-PCA | 10.2 | 6.6 | 5.9 | 4.8 | 5.6 | 12.5 | 25.9 | 16.3 | 33.1 | 12.1 | |
| RAFSIL2-pPCA | 9.8 | 7.1 | 4.9 | 4.0 | 5.3 | 11.2 | 26.3 | 14.3 | 30.6 | 11.4 | |
| SIMLR-tSNE | 3.7 | 3.3 | 4.4 | 4.8 | 5.5 | (26.2) | 19.8 | 3.0 | 15.7 | 8.6 | |
| SIMLR-PCA | 6.7 | 27.1 | 0.1 | 11.3 | 6.4 | (43.8) | 36.3 | 22.9 | 51.0 | 20.8 | |
| SIMLR-pPCA | 7.4 | 27.6 | 0.1 | 9.7 | 5.9 | (45) | 37.0 | 22.3 | 53.2 | 21.0 | |
| Data-HiE-tSNE | 7.4 | 12.1 | 14.3 | 0.3 | 1.6 | 3.7 | 15.0 | 37.0 | 3.3 | 10.7 | 10.5 |
| Data-HiE-PCA | 40.7 | 25.8 | 13.3 | 1.4 | 4.8 | 34.1 | 31.2 | 56.1 | 16.3 | 40.5 | 26.4 |
| Data-HiE-pPCA | 40.5 | 28.6 | 14.3 | 1.4 | 7.3 | 33.3 | 32.5 | 57.4 | 17.3 | 41.5 | 27.4 |
| Euclidean-HiE-tSNE | 4.4 | 4.4 | 3.9 | 0.4 | 6.5 | 8.2 | 23.8 | 39.1 | 5.3 | 21.1 | 11.7 |
| Euclidean-HiE-PCA | 36.5 | 7.7 | 35.0 | 7.0 | 25.8 | 58.7 | 32.5 | 52.8 | 19.9 | 39.1 | 31.5 |
| Euclidean-HiE-pPCA | 36.0 | 8.8 | 39.4 | 6.8 | 28.2 | 57.7 | 32.5 | 53.0 | 20.3 | 38.8 | 32.2 |
| Pearson-HiE-tSNE | 2.8 | 9.3 | 3.0 | 1.6 | 2.1 | 17.5 | 24.1 | 20.1 | 8.3 | ||
| Pearson-HiE-PCA | 25.1 | 23.1 | 16.3 | 0.1 | 2.4 | 27.8 | 12.5 | 49.1 | 10.6 | 27.9 | 19.5 |
| Pearson-HiE-pPCA | 24.0 | 23.1 | 17.2 | 0.3 | 2.4 | 27.5 | 15.0 | 47.6 | 11.3 | 28.1 | 19.7 |
| Spearman-HiE-tSNE | 3.3 | 11.0 | 1.0 | 3.2 | 15.4 | 3.0 | 18.4 | 6.1 | |||
| Spearman-HiE-PCA | 37.2 | 26.9 | 9.4 | 0.3 | 33.4 | 61.7 | 13.0 | 32.3 | 22.0 | ||
| Spearman-HiE-pPCA | 36.3 | 27.5 | 12.8 | 0.3 | 3.2 | 32.5 | 6.2 | 59.1 | 12.6 | 30.6 | 22.1 |
tSNE, t stochastic neighbor embedding; PCA, principal component analysis; pPCA, probabilistic PCA.
The best-performing method in each column is in boldface.
Parentheses indicate that SIMLR was run with different parameters for this dataset.
List of datasets analyzed and their attributes
| Dataset | Number of cells | Number of genes | Number of populations | Sparsity (in %) | Units | References |
|---|---|---|---|---|---|---|
| Patel | 430 | 5948 | 5 | 0 | TPM | |
| Buettener | 182 | 9573 | 3 | 37 | FPKM | |
| Engel | 203 | 21 690 | 4 | 80 | TPM | |
| Kolod | 704 | 13 473 | 3 | 10 | CPM | |
| Goolam | 124 | 41 480 | 5 | 69 | CPM | |
| Usoskin | 622 | 17 772 | 4 | 78 | RPM | |
| Treutlein | 80 | 23 271 | 5 | 90 | FPKM | |
| Leng | 460 | 19 084 | 4 | 47 | TPM | |
| Pollen | 301 | 9966 | 11 | 67 | TPM | |
| Lin | 402 | 9437 | 16 | 43 | TPM |
Fig. 1.RAFSIL2 discovers unwanted variation. This figure shows tSNE plots for two datasets: data from Usoskin in the first row, and from Kolodziejczyk in the second row. Cells are colored according to biologically meaningful annotations in panels one and three, and according to technical covariates in panels two and four. In both datasets biological annotations are different cell types. Technical covariates are different picking sessions (first row) and different sequencing chips (second row). In the first row, we see that sub-structure in biologically meaningful groupings can be explained through technical variables for both methods. In the second row, this still holds true for RAFSIL2, but SIMLR does not highlight the unwanted technical variation present in the data (for more details see Section 3.2.2).
Nearest neighbor error values for similarity learning (in percent, lower is better)
| Method | Patel | Buttener | Engel | Kolod | Goolam | Usoskin | Treutlein | Leng | Pollen | Lin | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|
| RAFSIL1 | 1.6 | 3.8 | 1.0 | 2.4 | 2.6 | 10.0 | 5.0 | 3.7 | 3.5 | ||
| RAFSIL2 | 3.8 | 3.2 | 4.3 | 5.2 | |||||||
| SIMLR | 2.4 | 3.4 | 4 | 3.1 | (25) | 14.8 | 3 | 6.2 | 6.0 | ||
| Pearson-ALL | 1.9 | 57.7 | 38.9 | 9.7 | 3.2 | 10.5 | 20.0 | 49.6 | 12.3 | 14.4 | 21.8 |
| Pearson-FRQ | 2.1 | 58.2 | 42.4 | 10.4 | 2.4 | 7.2 | 12.5 | 42.8 | 10.3 | 14.7 | 20.3 |
| Pearson-HiE | 3.5 | 33.5 | 15.3 | 9.8 | 1.6 | 4.7 | 11.2 | 48.5 | 6.3 | 10.4 | 14.5 |
| Spearman-ALL | 2.8 | 57.7 | 12.8 | 0.9 | 15.1 | 28.8 | 58.7 | 2.0 | 13.7 | 19.3 | |
| Spearman-FRQ | 1.9 | 57.7 | 10.3 | 0.9 | 10.1 | 8.8 | 44.6 | 13.2 | 15.0 | ||
| Spearman-HiE | 14.4 | 43.4 | 9.9 | 1.8 | 2.4 | 7.4 | 10.0 | 29.1 | 5.3 | 8.5 | 13.2 |
| Euclidean-ALL | 30.0 | 51.6 | 48.3 | 24.7 | 2.4 | 14.5 | 21.2 | 44.6 | 6.0 | 22.4 | 26.6 |
| Euclidean-FRQ | 2.1 | 57.7 | 39.9 | 10.5 | 2.4 | 7.4 | 12.5 | 45.9 | 9.3 | 13.7 | 20.1 |
| Euclidean-HiE | 4.0 | 33.5 | 13.8 | 8.8 | 1.6 | 3.7 | 12.5 | 47.4 | 7.0 | 10.7 | 14.3 |
ALL, all expressed genes; FRQ, frequency-filtered genes; HiE, highly-expressed genes.
The best-performing method in each column is in boldface.
Parentheses indicate that SIMLR was run with different parameters for this dataset.
ARI and NMI values for clustering methods across ten datasets (in percent, higher is better)
| Patel | Buettner | Engel | Kolod | Goolam | Usoskin | Treutlein | Leng | Pollen | Lin | Average | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI |
| RAFSIL1-KM | 89.6 | 88.4 | 27.7 | 47.0 | 54.4 | 73.5 | 76.9 | 77.8 | 34.8 | 59.2 | 84.4 | 92.0 | 51.9 | 73.6 | 66.3 | 76.5 | ||||||
| RAFSIL1-HC | 95.8 | 94.3 | 90.4 | 87.1 | 34.6 | 46.3 | 75.0 | 73.0 | 54.3 | 68.9 | 43.4 | 58.4 | 85.1 | 93.6 | 53.1 | 76.7 | 72.3 | 78.9 | ||||
| RAFSIL2-KM | 88.5 | 87.5 | 81.6 | 76.6 | 75.8 | 76.8 | 54.4 | 73.5 | 64.7 | 75.4 | 55.3 | 72.4 | 39.1 | 50.1 | 82.6 | 91.8 | 49.2 | 72.5 | 69.1 | 77.7 | ||
| RAFSIL2-HC | 97.0 | 95.5 | 84.3 | 80.6 | 36.7 | 53.0 | 91.6 | 54.7 | 81.2 | |||||||||||||
| SIMLR | 80.9 | 84.9 | 88.8 | 88.8 | 10.6 | 25.7 | 47.1 | 65.5 | 66.0 | 72.8 | (23.8) | (45.6) | 24.0 | 34.4 | 84.4 | 92.2 | 42.2 | 74.2 | 56.8 | 68.4 | ||
| SC3 | 88.7 | 86.1 | 46.0 | 64.2 | 54.4 | 73.5 | 84.5 | 81.6 | 54.3 | 63.1 | 32.8 | 55.5 | 95.3 | 71.4 | 80.0 | |||||||
| pcaReduce | 47.8 | 60.3 | 39.8 | 45.9 | 17.4 | 18.2 | 96.1 | 94.2 | 45.9 | 62.2 | 54.7 | 60.4 | 37.6 | 38.6 | 21.7 | 25.5 | 89.1 | 93.1 | 51.3 | 74.4 | 50.1 | 57.3 |
| SINCERA | 91.3 | 89.8 | 50.7 | 47.6 | 23.0 | 31.1 | 99.6 | 99.2 | 39.3 | 58.0 | 52.4 | 61.7 | 27.8 | 50.5 | 8.7 | 12.3 | 85.5 | 93.4 | 45.5 | 69.4 | 52.4 | 61.3 |
| Spearman-HiE-KM | 35.0 | 46.2 | 25.4 | 33.3 | 67.7 | 63.6 | 45.7 | 51.2 | 64.7 | 80.3 | 28.4 | 35.4 | 62.2 | 74.7 | 5.6 | 10.0 | 80.4 | 89.2 | 46.4 | 71.7 | 46.1 | 55.6 |
| Spearman-HiE-HC | 20.2 | 44.8 | 0.1 | 2.1 | 47.0 | 53.0 | 0.1 | 0.6 | 59.1 | 76.1 | 0.3 | 1.3 | 64.1 | 71.2 | 0.3 | 2.7 | 9.5 | 38.3 | 25.8 | 68.8 | 22.7 | 35.9 |
| Data-HiE-KM | 78.1 | 75.6 | 38.5 | 42.2 | 15.1 | 17.9 | 63.1 | 75.3 | 42.3 | 48.0 | 28.9 | 37.0 | 18.9 | 33.4 | 3.4 | 13.9 | 71.2 | 84.9 | 51.8 | 76.5 | 41.1 | 50.5 |
| Data-HiE-HC | 20.4 | 36.9 | 4.5 | 17.1 | 10.4 | 11.8 | 0.2 | 0.8 | 33.5 | 41.3 | 5.0 | 9.4 | 32.8 | 37.7 | −0.6 | 0.8 | 7.9 | 35.9 | 8.9 | 42.4 | 12.3 | 23.4 |
KM, k-means; HC, hierarchical clustering.
The best-performing method in each column is in boldface.
Parentheses indicate that SIMLR was run with different parameters for this dataset.
ARI and NMI values for clustering methods across ten datasets after dimension reduction (in percent, higher is better)
| Patel | Buettner | Engel | Kolod | Goolam | Usoskin | Treutlein | Leng | Pollen | Lin | Average | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI |
| RAFSIL1-tSNE-KM | 96.8 | 95.7 | 54.4 | 73.5 | 61.7 | 71.5 | 33.8 | 58.2 | 46.4 | 61.1 | 89.0 | 48.6 | 74.1 | 66.9 | 77.5 | |||||||
| RAFSIL1-tSNE-HC | 93.4 | 91.8 | 26.6 | 46.3 | 54.4 | 73.5 | 64.6 | 75.8 | 54.8 | 70.9 | 46.6 | 62.4 | 89.2 | 50.1 | 76.3 | |||||||
| RAFSIL2-tSNE-KM | 87.3 | 83.1 | 26.6 | 46.3 | 34.9 | 41.7 | 54.4 | 73.5 | 65.5 | 77.1 | 55.0 | 72.4 | 60.0 | 88.0 | 93.3 | 42.1 | 71.2 | 60.0 | 71.5 | |||
| RAFSIL2-tSNE-HC | 87.5 | 85.0 | 24.8 | 45.1 | 30.9 | 38.9 | 54.4 | 73.5 | 65.9 | 78.5 | 30.9 | 46.2 | 87.5 | 93.3 | 48.8 | 73.6 | 58.4 | 70.3 | ||||
| SIMLR-tSNE-KM | 90.8 | 89.6 | 88.8 | 88.8 | 10.6 | 25.7 | 47.1 | 65.5 | 66.0 | 73.4 | (27.3) | (30.0) | 47.1 | 82.4 | 90.5 | 41.3 | 71.8 | 60.1 | 70.1 | |||
| SIMLR-tSNE-HC | 80.9 | 84.9 | 88.8 | 88.8 | 10.6 | 25.7 | 47.1 | 65.5 | 66.0 | 73.4 | (40.7) | (41.7) | 47.7 | 72.5 | 88.4 | 42.1 | 74.2 | 59.6 | 70.7 | |||
| Data-tSNE-KM | 71.5 | 72.2 | 33.4 | 33.0 | 18.0 | 18.9 | 92.6 | 90.0 | 35.8 | 52.6 | 31.6 | 55.1 | 16.7 | 26.4 | 82.3 | 88.8 | 54.4 | 52.1 | 59.6 | |||
| Data-tSNE-HC | 66.4 | 67.3 | 25.6 | 29.7 | 24.1 | 29.3 | 59.2 | 63.4 | 45.9 | 62.7 | 80.4 | 74.5 | 40.5 | 52.7 | 1.8 | 9.5 | 93.4 | 77.5 | 49.3 | 56.0 | ||
| Pearson-tSNE-KM | 88.5 | 86.2 | 29.2 | 33.6 | 28.9 | 35.4 | 64.9 | 66.6 | 40.2 | 62.0 | 6.9 | 10.9 | 78.6 | 91.1 | 48.1 | 73.0 | 54.4 | 63.3 | ||||
| Pearson-tSNE-HC | 87.5 | 85.3 | 27.3 | 35.6 | 33.5 | 51.1 | 48.5 | 71.2 | 63.6 | 66.0 | 53.3 | 65.2 | 14.7 | 17.2 | 84.3 | 92.8 | 42.9 | 72.7 | 55.5 | 65.7 | ||
The best-performing method in each column is in boldface.
Parentheses indicate that SIMLR was run with different parameters for this dataset.
Fig. 2.RAFSIL2 yields accurate and robust clustering solutions. Panels are box plots of the ARI for ten datastes, across 20 instances of randomly sampling 90% of available cells. The panel labeled ‘Average’ represents the mean performance across all ten datasets. We see that RAFSIL2 followed by hierarchical clustering has the best performance, followed by SC3 and then the other RAFSIL-type methods. In terms of robustness SC3 performs best, while pcaReduce shows the highest variability (see Section 3.2.3 for a more detailed discussion). KM, k-means; HC, hierarchical clustering; HiE, highly expressed genes.