| Literature DB >> 32411698 |
Peng Liu1, Silvia Liu2, Yusi Fang1, Xiangning Xue1, Jian Zou1, George Tseng1, Liza Konnikova3,4,5.
Abstract
The progress in the field of high-dimensional cytometry has greatly increased the number of markers that can be simultaneously analyzed producing datasets with large numbers of parameters. Traditional biaxial manual gating might not be optimal for such datasets. To overcome this, a large number of automated tools have been developed to aid with cellular clustering of multi-dimensional datasets. Here were review two large categories of such tools; unsupervised and supervised clustering tools. After a thorough review of the popularity and use of each of the available unsupervised clustering tools, we focus on the top six tools to discuss their advantages and limitations. Furthermore, we employ a publicly available dataset to directly compare the usability, speed, and relative effectiveness of the available unsupervised and supervised tools. Finally, we discuss the current challenges for existing methods and future direction for the new generation of cell type identification approaches.Entities:
Keywords: CyTOF; auto-gating; cell type identification; clustering; manual gating; visualization
Year: 2020 PMID: 32411698 PMCID: PMC7198724 DOI: 10.3389/fcell.2020.00234
Source DB: PubMed Journal: Front Cell Dev Biol ISSN: 2296-634X
Comparison of manual gating, unsupervised and supervised clustering methods.
| Ease of use | Easy and straight forward for biologist | Tool dependent, generally easy to apply. See | Tool dependent, generally requires more steps than unsupervised clustering methods |
| Reproducibility | Reproducible between data for same user | Majority of the tools allow for setting a “seed” enabaling the reproducibility of the results. See | Variable (tool dependent) |
| Time cost | Experience and sample size dependent | Tool dependent, see | Tool dependent, generaly high. See |
| Flexibility | High, depends on user manual setting | Moderate, users can only adjust some parameters | Low |
| Novel subpopulation detection | Yes | Yes (tool dependent) | No (can only detect previously defined clusters) |
| Subpopulation/cluster identification | Manual (based on gating strategy) | Manual (based on cluster marker expression) | Automated (based on training set) |
| # of subpopulations/clusters | Experiment dependent | Variable (some allow users input; some automatically optimize #) See | Fixed (based on training set) |
| Prior knowledge requirement | Gating Experience, Marker expression for cellular identification | None for clustering; knowledge of marker expression for cluster identification | Training dataset or marker matrix, familiarity with bioinformatics |
FIGURE 1Overview of manual gating, unsupervised and supervised clustering tools for high-dimensional cytometry data analysis.
Overview of reviews of clustering tools.
| Algorithmic Tools for Mining High-Dimensional Cytometry Data | High-dimensional cytometry data | PCA, viSNE | Wanderlust | * Dimensionality-reduction techniques | Description of the applications | No | ||
| * Clustering-based analysis | ||||||||
| * Trajectory detection algorithm | ||||||||
| The end of gating? An introduction to automated analysis of high dimensional cytometry data | High-dimensional cytometry data | PCA | Wanderlust | * Algorithms for analysis of high-dimensional data | * Describe the applications in other publications * 14-parameter flow cytometry dataset as an practical example and be associated with | No | ||
| Gate to the future: Computational analysis of immunophenotyping data | High-dimensional cytometry data | t-SNE | None | * Manual gating | Description of the applications | No | ||
| * Algorithm-assisted gating | ||||||||
| * Algorithm-based clustering | ||||||||
| A Beginner’s Guide to Analyzing and Visualizing Mass Cytometry Data | Mass cytometry data | t-SNE, SPADE, others to match tools discussed | None | * Automated data analysis | Provide a detailed user-guide using two murine dataset | No | ||
| Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data | High-dimensional cytometry data | none, only performance comparison | None | * Clustering methods | Evaluate the tool performance with 6 dataset (4 CyTOF and 2 Flow Cytometry) | Yes (F1 score, running time, expression profiles, stability of the clustering results) | ||
| Computational flow cytometry: helping to make sense of high-dimensional immunology data | Flow cytometry data | SPADE, FlowMap, FlowSOM, viSNE, PhenoGraph, Scaffold map, DREMI-DREVI | None | * Methods based on dimensionality reduction techniques | Apply visualization techniques using a manual gated dataset and marker visualization application | No | ||
| * Clustering based techniques | ||||||||
| * Automated population identification | ||||||||
| * Biomarker identification | ||||||||
| * Cell development modeling | ||||||||
| Computational approaches for high-throughput single-cell data analysis | Single-cell RNA-seq | PCA, MDS, tSNE, Diffusion maps, SPRING, SPADE, FLOWSOM, Scaffold Maps, FLOWMAP, Phenograph | None | * Visualizing high-dimensional single-cell data | Visualization application using a publicly available scRNA- Seq PBMC dataset | No | ||
| * Dimensionality reduction clustering | ||||||||
| * Cell type identification | ||||||||
| * Cell type identification | ||||||||
| * clustering-based approach | ||||||||
| * Approaches for modeling gradual transitions | ||||||||
| * Differential analysis | ||||||||
| * Cytometry-based approaches | ||||||||
| * Sequencing-based approaches | ||||||||
| Meeting the Challenges of High-Dimensional Single-Cell Data Analysis in Immunology | Single-cell RNA-seq | tSNE, PCA, UMAP | Diffusion pseudotime (DPT); Partition-based graph abstraction (PAGA) | * Linear dimensionality reduction | Visualization and clustering application of a publicly available scRNA-Seq PBMC dataset | No | ||
| * Non-linear dimensionality reduction | ||||||||
| * Clustering methods; single-cell resolution is lost | ||||||||
| * Trajectory inference and graph abstraction | ||||||||
| CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets | Mass cytometry data | UMAP, tSNE. MDS | None | * Differential analysis * Cell population identification | Detailed data analysis workflow: data pre-processing, clustering, differential analysis and visualization of a publically available CyTOF PBMC dataset | No |
Unsupervised clustering tools.
| 1. | ACCENSE | 1. t-SNE dimensionality reduction; 2. k-means or density-based clustering | GUI application | n/a | Yes | No | No | 2.48* | 0.28* | 0.60* |
| 2. | CCAST | 1. identify cell population; 2. refine cluster assignment; 3. estimate a gating scheme by decision tree; 4. optimize the decision tree | R package “CCAST” | Decision tree | Yes | Yes | Yes | 77.32 | 0.71 | 0.72 |
| 3. | ClusterX | 1. t-SNE dimensionality reduction; 2. local density estimation; 3. peak detection; 4. clustering assigning | R package “cytofkit” | n/a | Yes | No | Yes | 105.14 | 0.25 | 0.22 |
| 4. | Cytometree | Implements a binary tree algorithm for clustering | R package “cytometree” | Binary tree | Yes | No | No | 12.30 | 0.08 | 0.20 |
| 5. | densityCUT | 1. density estimation; 2. density refinement; 3. local-maxima based clustering; 4. hierarchical stable clustering | R package “densitycut” | n/a | Yes | No | Yes | 3.94 | 0.78 | 0.34 |
| 6. | DensVM | 1. t-SNE dimension reduction; 2. density-based peak calling and clustering; 3. SVM classification for less-confident cells | R package “cytofkit” | n/a | Yes | No | No | 43.83* | 0.71* | 0.69* |
| 7. | DEPECHE | k-means clustering | R package “depecheR” | n/a | Yes | Yes | No | 3.46 | 0.75 | 0.53 |
| 8. | FLOCK | 1. hypergrid creation; 2. identifying dense hyperregions; 3. merging neighboring dense hyperregions; 4. clustering | Available at ImmPort online | n/a | Yes (Need to register at Galaxy) | No (can adjust # of bins and density) | Yes | 0.30 | 0.73 | 0.65 |
| 9. | flowClust | t-mixture models with the Box-Cox transformation | R package “flowClust” | n/a | Yes | Yes | Yes | 4.99 | 0.41 | 0.43 |
| 10. | FlowGrid | density-based clustering algorithm DBSCAN with the scalability of grid-based clustering | Github (Python package “FlowGrid”) | n/a | Yes | No (can adjust # of bins and density) | Yes | 0.25^ | 0.54 | 0.48 |
| 11. | flowMeans | k-means clustering | R package “flowMeans” | n/a | Yes | Yes | Yes | 6.01 | 0.64 | 0.63 |
| 12. | flowPeaks | 1. k-means; 2. Gaussian finite mixture to model the density function; 3. peak search and merging; 4. cluster tightening | R package “flowPeaks” | n/a | Yes | Yes | Yes | 0.19 | 0.64 | 0.55 |
| 13. | FlowSOM | 1. self-organization map building; 2. MST building; 3. perform meta-clustering | R package “FlowSOM” and “cytofkit” | MST, Chart plot | Yes | Yes | Yes (if set a seed) | 0.19 | 0.62 | 0.67 |
| 14. | PAC-MAN | 1. partitioning by density-based methods; 2. post-processing | R package “PAC” | n/a | Yes | Yes | Yes | 0.35 | 0.78 | 0.74 |
| 15. | PhenoGraph | 1. Construct nearest-neighbor graph; 2. community partitioning | R package “cytofkit” | n/a | Yes | No (Can adjust # of nearest neighbours) | Yes | 5.89 | 0.71 | 0.78 |
| 16. [github] | Rclusterpp | flexible native hierarchical clustering | R package “Rclusterpp” | Hierarchical-structure | Yes (Need to manually download source file) | No | Yes | 17.40 | 0.70 | 0.71 |
| 17. | SamSPECTRAL | Spectral-clustering with data reduction scheme | R package “SamSPECTRAL” | n/a | No (requires manual tuning for optimal results) | Yes | Yes | 24.70 | 0.57 | 0.33 |
| 18. | SPADE | 1. Density-dependent down-sampling; 2. MST construction | R package “spade” | MST | Yes | Yes (given cluster number K, it can create between [k/2,3k/2] clusters | No | 2.83 | 0.58 | 0.66 |
| 19. | SWIFT | 1. Fit GMM; 2. Refine GMM; 3. agglomerative merging | GUI application by Matlab | n/a | Yes | No (can adjust # of bins and density) | No | 20.02* | 0.06* | 0.29* |
| 20. | X-shift | 1. estimate cell event density; 2. arrange populations by maker-based classification | GUI application | Divisive Marker Trees | Yes | Yes | Yes | 35.10 | 0.65 | 0.67 |
| 21. | immunoClust | 1. iterative model-based clustering; 2. meta-clustering | R package “immunoClust” | n/a | Yes | No | Yes | 82.72 | 0.29 | 0.47 |
| 22. Flock | k-means | k-means clustering | R base package “stats” | n/a | Yes | Yes | Yes | 11.68 | 0.63 | 0.63 |
| 23. | Citrus | cluster identification, characterization and regression | R package “Citrus” | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| 24. | CellCnn | convolutional neural networks | Python 2.7 package on Github | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| 25. | Cydar | 1. cell alignment in hyperspheres in high dimensional space; 2. differential abundance analysis | R package “cydar” | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| 26. | diffcyt | 1. FlowSOM clustering; 2. empirical Bayes moderated tests for differential abundance analysis | R package “diffcyt” | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| 27. | AUTO-SPADE | 1. Fuzzy-C-Mean clustering; 2. Merging clusters using Markov clustering; 3. Integration with SPADE | No tool available | |||||||
| 28. | CytoSPADE | SPADE clustering | No tool available | |||||||
| 29. | DBM | density based merging (DBM) algorithm | No tool available | |||||||
| 30. | FLAME | multivariate skew t mixture models | No full tool pipeline available | |||||||
| 31. | flowMerge | 1. clustering based on flowClust models; 2. merge clusters | For the downsampled data, number of cluster ranging from 15 to 25 wa applied, but it showed out NA merged result. | |||||||
| 32. | Flow-SNE | 1. t-SNE data embedding; 2. cluster number estimation; 3. k-means clustering; 4. merging of clusters | No tool available | |||||||
FIGURE 2Visualization of dimensionality reduction tools. (A) Principal component analysis (PCA); (B) t-Distributed Stochastic Neighbor Embedding (t-SNE) and (C) Uniform Manifold Approximation and Projection (UMAP). All three dimensional reduction approaches were applied to the same sample from Fluidigm Maxpar Direct Immune Profiling Assay dataset available of Cytobank. The input data was 184,968 CD45 + cells and 21 markers were used (Supplementary Table S2). PCA, t-SNE and UMAP were performed in R using prcomp, Rtsne, and umap functions respectively. Manually gated subpopulations were uniformly colored across all the three plots.
FIGURE 3Number of citations and applications for unsupervised clustering tools in seven immunology journals since 2015. (A) Number of total citations based on Google Scholar; (B) Number of tool applications, only counting citations with real data application by the tools.
FIGURE 4t-SNE visualizations for manual gating and five popular unsupervised clustering tools. (A) Manual gating, (B) ACCENSE, (C) DensVM, (D) SPADE, (E) FlowSOM and (F) PhenoGraph. Manual gating and these five tools were applied to the same data as illustrated in Figure 2. The original t-SNE (Supplementary Figure S4) was altered so that all clustering results were visualized with uniform color coding, where one color represents the same population across all the t-SNE plots. Clusters were manually annotated based on marker expression in each cluster by each clustering method (Supplementary Figure S3). Tools were applied to the full dataset with 180K cells, except for ACCENSE and DensVM, where the data was down-sampled to 20K cells prior to applying the tools as using the full dataset had a running time greater than 3 h. For SPADE and FlowSOM, we set the number of clusters to 20. For ACCENSE, PhenoGraph and DensVM, the number of clusters was automatically optimized by the tool.
FIGURE 5Tree-structure visualization for SPADE and FlowSOM. (A) SPADE tree colored by CD3 intensity. (B) FlowSOM tree with multiple maker intensities. Both tools were applied to the example from Figure 2.
Supervised clustering tools.
| 1. | OpenCyto | A method mimicking manual gating by incorporating information from a gating template | R package “openCyto” available on Bioconductor | Gating template, can be a complete table or added inline one cell type at a time | Tututorials avaialble, preparing the gating template is challanginng | Fast (running time depends on the choice of algorithms in the gating template) | Not evaluated | Not evaluated | We did not evaluate ARI and F measure because OpenCyto is not fully automated, it needs user”s supervision and fine parameterization. |
| 2. | DeepCyTOF | Uses training data to predict cell types based on deep learning techniques | Python, Github | Training data | Time consuming to understand examples scripts and adapt it to your own data | 1.36 | 0.96 | 0.93 | 50% of the cells in the sample were randomly chosen as training sample |
| 3. | CyTOF Linear Classifier | Uses training data to predict the cell types based on linear discriminant analysis (LDA) | R, Matlab, Github | Training data | Easy to run | 0.12 | 0.91 | 0.92 | 50% of the cells in the sample were randomly chosen as training sample |
| 4. | ACDC | Uses a marker matrix information to predict the cell types based on semi-supervised learning techniques | Python package (Bitbucket) | Markers matrix | Time consuming to understand examples scripts and adapt it to your own data | 24 | 0.81 | 0.77 | – |
| 5. | MP (Mondrian) | Uses a marker matrix to predict cell types through a Bayesian model | Python, github | Markers matrix | Time consuming to understand examples scripts and adapt it to your own data | 109 | 0.55 | 0.49 | 50% of the cells in the sample were randomly chosen as training sample |
| 6. | flowLearn | Uses gates from training data to predict gating threshold in other samples through a density alignment | R package flowLearn, github | Training data | Not fully automate, gate one marker at a step. | Fast for predicting one threshold at each step | Not evaluated | Not evaluated | We did not evaluate ARI and F measure because flowLearn is not fully automated and needs user’s supervision. |