| Literature DB >> 35404970 |
Lijun Cheng1, Pratik Karkhanis1, Birkan Gokbag1, Yueze Liu2, Lang Li1.
Abstract
Single-cell mass cytometry, also known as cytometry by time of flight (CyTOF) is a powerful high-throughput technology that allows analysis of up to 50 protein markers per cell for the quantification and classification of single cells. Traditional manual gating utilized to identify new cell populations has been inadequate, inefficient, unreliable, and difficult to use, and no algorithms to identify both calibration and new cell populations has been well established. A deep learning with graphic cluster (DGCyTOF) visualization is developed as a new integrated embedding visualization approach in identifying canonical and new cell types. The DGCyTOF combines deep-learning classification and hierarchical stable-clustering methods to sequentially build a tri-layer construct for known cell types and the identification of new cell types. First, deep classification learning is constructed to distinguish calibration cell populations from all cells by softmax classification assignment under a probability threshold, and graph embedding clustering is then used to identify new cell populations sequentially. In the middle of two-layer, cell labels are automatically adjusted between new and unknown cell populations via a feedback loop using an iteration calibration system to reduce the rate of error in the identification of cell types, and a 3-dimensional (3D) visualization platform is finally developed to display the cell clusters with all cell-population types annotated. Utilizing two benchmark CyTOF databases comprising up to 43 million cells, we compared accuracy and speed in the identification of cell types among DGCyTOF, DeepCyTOF, and other technologies including dimension reduction with clustering, including Principal Component Analysis (PCA), Factor Analysis (FA), Independent Component Analysis (ICA), Isometric Feature Mapping (Isomap), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) with k-means clustering and Gaussian mixture clustering. We observed the DGCyTOF represents a robust complete learning system with high accuracy, speed and visualization by eight measurement criteria. The DGCyTOF displayed F-scores of 0.9921 for CyTOF1 and 0.9992 for CyTOF2 datasets, whereas those scores were only 0.507 and 0.529 for the t-SNE+k-means; 0.565 and 0.59, for UMAP+ k-means. Comparison of DGCyTOF with t-SNE and UMAP visualization in accuracy demonstrated its approximately 35% superiority in predicting cell types. In addition, observation of cell-population distribution was more intuitive in the 3D visualization in DGCyTOF than t-SNE and UMAP visualization. The DGCyTOF model can automatically assign known labels to single cells with high accuracy using deep-learning classification assembling with traditional graph-clustering and dimension-reduction strategies. Guided by a calibration system, the model seeks optimal accuracy balance among calibration cell populations and unknown cell types, yielding a complete and robust learning system that is highly accurate in the identification of cell populations compared to results using other methods in the analysis of single-cell CyTOF data. Application of the DGCyTOF method to identify cell populations could be extended to the analysis of single-cell RNASeq data and other omics data.Entities:
Mesh:
Year: 2022 PMID: 35404970 PMCID: PMC9060369 DOI: 10.1371/journal.pcbi.1008885
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.779
Fig 2Cell population identification by DGCyTOF in the analysis of CyTOF1 and CyTOF2 datasets.
Fig 2A identifies the 32 types of known cells by deep classification learning for dataset CyTOF1, and Fig 2C, the 13 types of known cells for CyTOF2. Fig 2B and 2D show the spectral clustering for the identification and visualization of unknown cell populations in the two datasets.
Two CyTOF benchmark data sets for analysis.
| Database | No. of Cells | No. of markers | No. of manually gated populations | No. of manually gated cells (label data) |
|---|---|---|---|---|
| CyTOF1 | 167,004 | 13 | 24 | 81,747 |
| CyTOF2 | 265,627 | 32 | 14 | 104,184 |
Contingency table for calculating the receiver operating characteristic curve.
| Total population | Condition positive | Condition negative | Prevalence |
|---|---|---|---|
| Predicted condition positive |
|
| |
| Predicted condition negative |
|
| |
Comparison of methods for averaging performance in the identification of known cell types in training and testing data by different measurements for CyTOF1 and CyTOF2 datasets.
| Measurement | DGCyTOF | DeepCyTOF | ||
|---|---|---|---|---|
| CyTOF1 | CyTOF2 | CyTOF1 | CyTOF2 | |
| 0.9921 | 0.9992 | 0.9925 | 0.999 | |
| 0.9924 | 0.9991 | 0.992 | 0.9981 | |
| 0.9932 | 0.9993 | 0.993 | 0.9992 | |
| 0.9822 | 0.987 | 0.9931 | 0.986 | |
Note: All arrow indicators showed good trends
Comparison of machine-learning methods by different measurements for CyTOF Dataset 1 (13 biomarkers, 24 labeled cell types).
| Methods | Measurement |
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|---|
|
| Computation time (in | 0.017 | 3.618 | 0.060 | 0.020 | 18.806 | 94.466 | |
| 0.626 | 0.633 | 0.625 | 0.626 | 0.462 | 0.423 | |||
| 40.6% | 33.4% | - | 40.61% | N/A | N/A | |||
| Visualization | DD | DD | DD | DD | DD | ED | ||
|
| 0.286 | 0.282 | 0.288 | 0.269 | 0.565 | 0.507 | 0.286 | |
| 0.236 | 0.228 | 0.236 | 0.216 | 0.556 | 0.483 | 0.235 | ||
| 0.307 | 0.298 | 0.307 | 0.286 | 0.627 | 0.563 | 0.306 | ||
| 0.494 | 0.488 | 0.495 | 0.477 | 0.793 | 0.762 | 0.494 | ||
|
| 0.305 | 0.304 | 0.505 | 0.285 | 0.588 | 0.502 | 0.530 | |
| 0.247 | 0.258 | 0.436 | 0.229 | 0.538 | 0.493 | 0.494 | ||
| 0.317 | 0.326 | 0.544 | 0.298 | 0.608 | 0.573 | 0.556 | ||
| 0.497 | 0.490 | 0.586 | 0.481 | 0.790 | 0.768 | 0.704 | ||
|
| 0.442 | 0.451 | 0.442 | 0.438 |
| 0.771 | 0.596 | |
| 0.336 | 0.349 | 0.337 | 0.332 |
| 0.738 | 0.534 | ||
| 0.526 | 0.531 | 0.524 | 0.524 |
| 0.789 | 0.621 | ||
| 0.557 | 0.570 | 0.552 | 0.555 |
| 0.850 | 0.557 |
Note–All arrow indicators showed good trends. ARI, adjusted Rand index, measure of similarity between two clusters, involves random labeling independent of the number of clusters; DD, difficult to distinguish; ED, easy to distinguish; FMI, Fowlkes-Mallows score, geometric mean of pair-wise precision and recall; F-score, harmonic mean of precision and recall (values range from 0 [bad] to 1 [good]); NPE, neighborhood proportion error; V-measure, harmonic mean of homogeneity and completeness. All results reflect comparison of two dimensions, and the number of nearest neighbors (k) is 20.
Comparison of machine-learning methods by different measurements for CyTOF Dataset 2 (32 biomarkers, 14 labeled cell types).
| Methods | Measurement |
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|
|
| Computation time (in | 0.0253 | 0.208 | 0.054 | 0.021 | 16.944 | 95.609 | |
| 0.536 | 0.525 | 0.535 | 0.536 | 0.399 | 0.393 | |||
| 31.03% | 27.70% | - | 31.03% | N/A | N/A | |||
| Visualization | DD | DD | DD | DD |
| ED | ||
|
| 0.426 | 0.421 | 0.431 | 0.322 | 0.590 | 0.529 | 0.458 | |
| 0.343 | 0.330 | 0.349 | 0.232 | 0.540 | 0.475 | 0.409 | ||
| 0.444 | 0.431 | 0.448 | 0.340 | 0.637 | 0.578 | 0.537 | ||
| 0.609 | 0.582 | 0.611 | 0.444 | 0.799 | 0.744 | 0.735 | ||
|
| 0.497 | 0.446 | 0.670 | 0.313 | 0.626 | 0.585 | 0.395 | |
| 0.406 | 0.353 | 0.573 | 0.221 | 0.577 | 0.534 | 0.339 | ||
| 0.500 | 0.453 | 0.706 | 0.331 | 0.665 | 0.631 | 0.461 | ||
| 0.636 | 0.589 | 0.690 | 0.443 | 0.807 | 0.785 | 0.684 | ||
|
| 0.669 | 0.650 | 0.665 | 0.560 |
| 0.923 | 0.684 | |
| 0.573 | 0.547 | 0.569 | 0.417 |
| 0.907 | 0.601 | ||
| 0.696 | 0.680 | 0.691 | 0.616 |
| 0.923 | 0.696 | ||
| 0.698 | 0.657 | 0.691 | 0.565 |
| 0.898 | 0.701 |
Note–All arrow indicators showed good trends. ARI, adjusted Rand index, measure of similarity between two clusters, involves random labeling independent of the number of clusters; DD, difficult to distinguish; ED, easy to distinguish; FMI, Fowlkes-Mallows score, geometric mean of pair-wise precision and recall; F-score, harmonic mean of precision and recall (values range from 0 [bad] to 1 [good]); NPE, neighborhood proportion error; V-measure, harmonic mean of homogeneity and completeness. All results reflect comparison of two dimensions, and the number of nearest neighbors (k) is 20.
Calibration of cell types utilizing calibration feedback for CyTOF1 and CyTOF2 data.
| CyTOF1 data | Coefficient ( | CyTOF2 data | Coefficient ( | ||
|---|---|---|---|---|---|
| Cell type | Before | After | Cell type | Before | After |
| CD11b-_Monocyte_cells | 0.6627 | 0.6657 | Basophils | 0.6094 | 0.613 |
| CD11bhi_Monocyte_cells | 0.7261 | 0.7261 | CD16-_NK_cells | 0.5474 | 0.5481 |
| CD11bmid_Monocyte_cells | 0.6666 | 0.6696 | CD16+_NK_cells | 0.6138 | 0.617 |
| CMP_cells | 0.4809 | 0.4864 | CD34+CD38+CD123-HSPC | 0.6346 | 0.6403 |
| Erythroblast_cells | 0.3733 | 0.3756 | CD34+CD38+CD123+HSPC | 0.6658 | 0.6992 |
| GMP_cells | 0.5715 | 0.5796 | CD34+CD38lo_HSCs | 0.5879 | 0.5942 |
| HSC_cells | 0.5544 | 0.5734 | CD4_T_cells | 0.6095 | 0.6096 |
| Immature_B_cells | 0.3899 | 0.3932 | CD8_T_cells | 0.6247 | 0.6249 |
| Mature_CD38lo_B_cells | 0.4863 | 0.4866 | Mature_B_cells | 0.6806 | 0.6806 |
| Mature_CD38mid_B_cells | 0.5594 | 0.5614 | Monocytes | 0.6925 | 0.6926 |
| Mature_CD4+_T_cells | 0.5155 | 0.517 | pDCs | 0.6511 | 0.6568 |
| Mature_CD8+_T_cells | 0.5916 | 0.5935 | Plasma_B_cells | 0.6055 | 0.6148 |
| Megakaryocyte_cells | 0.2805 | 0.2854 | Pre_B_cells | 0.6462 | 0.6475 |
| MEP_cells | 0.6374 | 0.6492 | Pro_B_cells | 0.6837 | 0.6914 |
| MPP_cells | 0.4966 | 0.5041 | |||
| Myelocyte_cells | 0.3919 | 0.3927 | |||
| Naive_CD4+_T_cells | 0.6915 | 0.6931 | |||
| Naive_CD8+_T_cells | 0.6891 | 0.6907 | |||
| NK_cells | 0.4645 | 0.4656 | |||
| Plasma_cell_cells | 0.4622 | 0.4638 | |||
| Plasmacytoid_DC_cells | 0.6214 | 0.6388 | |||
| Platelet_cells | 0.4867 | 0.5078 | |||
| Pre-B_I_cells | 0.559 | 0.5657 | |||
| Pre-B_II_cells | 0.5436 | 0.5456 | |||
| Cell-type homology | 0.537608 | 0.542942 | 0.632336 | 0.637857 | |