Literature DB >> 35404970

DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data.

Lijun Cheng¹, Pratik Karkhanis¹, Birkan Gokbag¹, Yueze Liu², Lang Li¹.

Abstract

Single-cell mass cytometry, also known as cytometry by time of flight (CyTOF) is a powerful high-throughput technology that allows analysis of up to 50 protein markers per cell for the quantification and classification of single cells. Traditional manual gating utilized to identify new cell populations has been inadequate, inefficient, unreliable, and difficult to use, and no algorithms to identify both calibration and new cell populations has been well established. A deep learning with graphic cluster (DGCyTOF) visualization is developed as a new integrated embedding visualization approach in identifying canonical and new cell types. The DGCyTOF combines deep-learning classification and hierarchical stable-clustering methods to sequentially build a tri-layer construct for known cell types and the identification of new cell types. First, deep classification learning is constructed to distinguish calibration cell populations from all cells by softmax classification assignment under a probability threshold, and graph embedding clustering is then used to identify new cell populations sequentially. In the middle of two-layer, cell labels are automatically adjusted between new and unknown cell populations via a feedback loop using an iteration calibration system to reduce the rate of error in the identification of cell types, and a 3-dimensional (3D) visualization platform is finally developed to display the cell clusters with all cell-population types annotated. Utilizing two benchmark CyTOF databases comprising up to 43 million cells, we compared accuracy and speed in the identification of cell types among DGCyTOF, DeepCyTOF, and other technologies including dimension reduction with clustering, including Principal Component Analysis (PCA), Factor Analysis (FA), Independent Component Analysis (ICA), Isometric Feature Mapping (Isomap), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) with k-means clustering and Gaussian mixture clustering. We observed the DGCyTOF represents a robust complete learning system with high accuracy, speed and visualization by eight measurement criteria. The DGCyTOF displayed F-scores of 0.9921 for CyTOF1 and 0.9992 for CyTOF2 datasets, whereas those scores were only 0.507 and 0.529 for the t-SNE+k-means; 0.565 and 0.59, for UMAP+ k-means. Comparison of DGCyTOF with t-SNE and UMAP visualization in accuracy demonstrated its approximately 35% superiority in predicting cell types. In addition, observation of cell-population distribution was more intuitive in the 3D visualization in DGCyTOF than t-SNE and UMAP visualization. The DGCyTOF model can automatically assign known labels to single cells with high accuracy using deep-learning classification assembling with traditional graph-clustering and dimension-reduction strategies. Guided by a calibration system, the model seeks optimal accuracy balance among calibration cell populations and unknown cell types, yielding a complete and robust learning system that is highly accurate in the identification of cell populations compared to results using other methods in the analysis of single-cell CyTOF data. Application of the DGCyTOF method to identify cell populations could be extended to the analysis of single-cell RNASeq data and other omics data.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35404970 PMCID： PMC9060369 DOI： 10.1371/journal.pcbi.1008885

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

This is a PLOS Computational Biology Methods paper.

Introduction

The identification of different cell populations has become an essential focus in cell biology research, and mass cytometry and other such high-throughput technologies are rapidly developing to identify novel cell populations at the level of the individual cell [1-3]. Mass cytometry, also known as cytometry by time of flight, or CyTOF, is a variation of flow cytometry in which protein antibodies are labeled with heavy metal ion tags [4]. Capable of measuring up to 50 proteins in a single cell [5,6] and screening an average of approximately 120 million cells in each experiment, CyTOF serves as an important source of big data and powerful tool for the study of cellular diversity and dynamics. Many recent studies highlight its utility in enabling novel discoveries and enhancing the understanding of cell-to-cell interactions in multiple domains of immunology [7,8] concerning cell-subset heterogeneity and tissue localization [9,10]. Indeed, the characterization of cells by CyTOF improves our understanding of disease progression and drug-response sensitivity or resistance [8]. Many single-cell experiments focus on the identification of the types of cells present in a sample from single-cell CyTOF data, and conventional cell-type identification in CyTOF involves the sequential manual partition, or “gating,” of cells into subpopulations. Termed canonical cell population identification, gating requires the visual inspection of scatter plots by one or two protein biomarkers at a time [11-13] using tools such as FlowJo or FlowCore (Yale Flow Cytometry vendor) [14]. Gating is an important step in assigning individual cells into discrete cell types [15], but it is very time-consuming, technically inefficient, and prone to human error. Gating with 50 markers, for example, might yield about 250 = 1015 cell types [16], so manual gating will not allow an exhaustive search of all cell populations [17]. Furthermore, the high dimensionality and large proportion of none-labeled CyTOF data pose considerable challenges to the identification of cell populations [18,19]. New identification tools are urgently needed that will automate the analysis of data and permit the fulfillment of CyTOF’s potential for biological discovery and translational applications. Many clustering tools can perform this task, which is essential to identify “new” cell populations in explorative experiments. However, relying on clustering is laborious since it often involves manual annotation, which significantly limits the reproducibility of identifying cell-populations across different samples. Many clustering tools can identify cell populations in exploratory experiments [19]. For instance, the k-means clustering algorithm iteratively assigns data points (cells) [20] to the nearest centroids (cluster center) and recomputes the centroids based on a predefined number of clusters, and Gaussian Mixture Models (GMMs) [21] tend to group together data points belonging to a single distribution by assuming a certain number of Gaussian distributions, each of which represents a cluster. However, clusters obtained by these algorithms might not be robust. Such algorithms require non-intuitive parameters [22], like the number k of clusters. Moreover, for some applications, these algorithms are node-density strategies that might be insufficient to discover the clusters best representing the underlying data structures, such as the node and node relationship of hierarchical structure in a tree or graph [23]. In particular, one large dataset might distribute a large number of very dense objects in some areas and only a few objects in others. Hierarchical density-based spatial clustering of applications with noise clustering (HDBSCAN) [24,25] allows the analysis of datasets comprising millions of cells and provides aggregate information on generated hierarchical tree clusters, but information regarding local data structure, that is, single-cell resolution, is sacrificed. Graph-density clustering provides a promising novel strategy for cell-population identification, but the clustering often involves manual annotation, again, a time-intensive and laborious process, and this significantly impedes the reproducibility of cell-population identification across different samples [23]. The learning processes in these clustering methods for the automatic identification of cell populations must depend on prior biological knowledge about the populations to identify canonical cell populations [5]. The HDBSCAN clustering method can find arbitrarily-shaped clusters and maintain the topology of the structure of the data and does not require a priori specification of the number k of clusters in the data, as is the case with k-means. Clustering and dimension-reduction technologies have often been combined to visualize data in two dimensions and thereby enhance interpretive capabilities in the analysis of mass-cytometry data [19,26,27]. Such dimension-reduction techniques include Principal Component Analysis (PCA) [28], Factor Analysis (FA) [29], Independent Component Analysis (ICA) [30], Isometric Feature Mapping (Isomap) [31], t-distributed Stochastic Neighbor Embedding (t-SNE) [32], and Uniform Manifold Approximation and Projection (UMAP) [33,34]. In particular, the latest t-SNE and UMAP methods are the two most widely used graph-embedding techniques for visualization. They involve the construction of a high-dimensional graph representation of data followed by optimization of a low-dimensional graph to be as structurally similar as possible. Both algorithms excel at revealing local structure in high-dimensional data. With lots of clustering and dimension-reduction technologies development, how to select appropriated clustering and dimension reduction in the identification of cell types in CyTOF data remains under investigation. The recent development of deep learning (DL) as a powerful machine-learning method was inspired by the mechanisms of artificial neural networks in the brain, especially those underlying the recognition of patterns from images [35] and the natural processing of language [36]. The deep-learning graphic-clustering approach, DeepCyTOF, shows promise in identification known cell types from massive volumes of CyTOF data by automated labeling technology [37]. DeepCyTOF cell-type clusters are trained using samples labeled with marker genes, assigning target cells to canonical cell types with 99% accuracy compared with the use of real labels in CyTOF data analysis. However, this method cannot detect new cell populations beyond cell types defined by the reference samples and visualization, which is different to DGCyTOF. DGCyTOF has developed as a new integrated embedding visualization approach in response to the performance of DeepCyTOF in identifying canonical cell types and reducing dimensionality with clustering for visualization in the prediction of new cell types. DGCyTOF combines deep-learning classification and hierarchical stable-clustering methods to sequentially build a tri-layer construct for known cell types, the identification of new cell types, and visualization. At the same time, this technique preserves the local structure among single cells by a detailing of non-linear dimensionality-reduction-based method in 3-dimensional (3D) visualization. DGCyTOF first utilizes marker-labeled samples to calibrate cells into different classifications and then applies dimension-reduction and hierarchical clustering methods to assign the unlabeled samples into the appropriate classifications and calibrate their labels. New cells are identified by the integration of UMAP and HDBSCAN hierarchical tree representation of the complete data, which preserves the non-linear high-dimensional relationships between cells in low-dimensional space. The integration framework in DGCyTOF allows interactive exploration of the hierarchy by a set of embeddings, 3D scatter plots in which cells are positioned based on the similarity of all marker expressions and used for subsequent analysis, such as the clustering of cells at different levels of the hierarchy. Utilizing two benchmark CyTOF databases comprising up to 43 million cells, we compared accuracy and speed in the precise identification of cell types among DGCyTOF, DeepCyTOF, and other technologies including dimension reduction with clustering, we observed the superior performance of DGCyTOF by eight measurement criteria. In addition, we identified previously missed rare cell populations specifically associated with diseases in both the innate and adaptive immune compartments.

Materials and methods

Datasets

delineates two CyTOF benchmark data sets, CyTOF1 and CyTOF2 from healthy human bone marrow mononuclear cells (BMMCs) [38,39]. All proteins markers do not have overlapping in the two datasets. (1) The first, CyTOF1, generated by [38] from one healthy volunteer, comprised data of approximately 167,000 cells with 13 markers) about half of which had been manually gated into 24 cell populations that included protein-expression levels from healthy human BMMCs. (2) The second set, CyTOF2, generated by [39] from two healthy human donors, consisted of almost 266,000 cells with 32 protein-expression markers, about 39% (104,184) of which had been manually gated into 14 major immune cell populations, with the remaining 61% (161,443) labeled as "unassigned."

Methods

Overview of DGCyTOF: Deep learning with graphic clustering in calibration-feedback learning for the analysis of CyTOF data

DGCyTOF combines deep-learning classification, graphic clustering, and dimension reduction in a sequential process to automate the classification of canonical cell populations and thereby overcome many limitations associated with traditional methods of cell-type identification and augment the discovery of novel populations from mass-cytometry data (). The deep-learning model () is used to predict cell-type labels of a new dataset based on a reference dataset, in which each of the cells had been labeled according to its canonical cell type. The UMAP and HDBSCAN is used in clustering for new cell type identification by force-directed graph algorithms involving spring-like attractive forces and electrical repulsions between nodes connected by edges on hierarchy clusters that reveal single “outlier” cells in the left small set of CyTOF datasets after canonical cell identification. In the middle layer of classification and clustering, a calibration-feedback system is used to adjust the identification of cell types to reduce false-negative errors (Type II errors) between classification and clustering, and projection of labels in 3D space provides their vivid visualization depiction for easy differentiation of types of cell annotations (). The whole process includes the following five steps:

A framework of DGCyTOF model in the identification of canonical cell population and new cell type populations.

A) The flowchart of DGCyTOF. To single cell data, it includes labeled and unlabeled data in CyTOF database. Identification of cell types includes four processes. (1) To cells labeled, a supervised deep learning automatically identifies canonical cell populations or cell types gated by protein markers, the detailed description sees (B). (2) To new cell population, a novel graphic-clustering integrating UMAP + HDBSCAN allows a learning of feature representations and preservation of data structure in a network of cell-to-cell interaction for the assignment of clusters for identification of new cell populations. (3) These cell types from classification and clustering are adjusted between (1) and (2) layers above mentioned via a feedback-loop using an iteration calibration system to reduce false-negative errors in the system integrating cell identification. (4) In the final step, a tool permitting three-dimensional (3D) visualization is developed to display the cell clusters, projecting all cell type labels into independent 3D space for their vivid depiction and differentiation to facilitate the identification of cell types. B) A three-layer artificial neural network constructs the deep classification-learning model for the identification of canonical cell populations. C) A calibration-feedback learning system for cell type correction. After deep learning model in Fig 1A, there are lots of known cell types identified (here called existing cluster). A correlation threshold value averaging the Spearman correlation determines whether the cell belongs to these known cell population. If correlation of the filtered cell with cells from the given canonical cell population is greater than the correlation threshold in this population, we reallocate that cell to this canonical population. Step 1: Predict canonical cell type using a deep neural network model: softmax classifier. A three-layer artificial neural network constructs the deep classification-learning model for the identification of canonical cell populations ( Given a labeled dataset , where x∈R, x are the protein markers, y is the labeled cell population, and y∈{1,…,K}, K is the number of manually gated cell populations. In each cell, the softmax classifier [40] (multinomial logistic regression) estimates the probabilities of that cell belonging to K different cell populations, and the true population is that with the highest probability (). The likelihood of the cell belonging to a particular population is given by hΘ(X), which takes the form: where θ( is the parameter sets on the output layer, s, is the weight parameter, and b( is a bias constant. The weight parameter, w(, and bias, b(, are learned through training on labeled samples and minimizing the loss function, L(w,λ) which is the binary cross-entropy between the observed labels, Y, and predictions, as function (2) where λ is the regularization strength (hyperparameter) to penalize "large" weight w coefficients. To output layer, s, the i node (i−classification) logistic regression link function in L can be written: The loss function, L, is the sum over the difference between the observed labels, Y, and predictions, , changing Function (2) as follows: Adaptive moment (ADAM) optimization algorithm [41] is used to minimize the loss function. The half-half samples labeled is the training data and the test data. After deep learning model is trained, these unlabeled data will input to the trained model to predict their cell-type labels. We thus obtain a prediction probability matrix that denotes the probability of each cell belonging to different populations. Rows in the matrix represent populations, K, and columns are cells. To classify cell types, we trained depth-N feed-forward neural nets, each consisting of three hidden layers and a output layer. To determine the optimal number of layers, penalization weight (λ) and their drop-out rate in the neural network, DGCyTOF utilized an automated pruning way of building neural network to improve computational and classification performance. By using particle swarm-optimized-based pruning algorithm [42], we selected the optimal multiple parameters (the number of layers, lambda, drop-out rate) in Python during network training. This objective maximization is to keep with the classify accuracy solution that maximizes the area under the Receiver operating characteristic (ROC) curve between real labels and prediction labels. Step 2: Filtering cells to further detect cell populations. To keep homogeneous cells in a cluster, an optimal threshold value, Th, is selected to filter out cells that do not fit well into the cluster. A histogram provides an accurate representation of the probability distribution in each cell population. A cell with a confidence threshold Th value below 5% is removed from the labeled cluster from Step1, labeled “unknown,” and assigned to a new population for future re-labelling as Eq (5) [5]. By the end of Step 2, about 5% of cells will belong to no particular cell population. Step 3: Feedback calibration system to reduce misclassification error. Despite the strong classification ability of the deep-learning model in Step 1, some cells will be misclassified into incorrect cell-type populations or filtered out of their appropriate clusters, yielding potentially 10 to 20% false negatives after Steps 1 and 2. It is highly desirable to construct calibration feedback and reclassify cells to reduce the error and clarify the homology of cell types within each cluster. Feedback calibration learning system is designed to calibrate the cell-to-cell correlation of inner cell types and improve the homology of cell types within clusters (). A Spearman correlation threshold is applied to reallocate these “outlier” points to seek associated clusters [43]. For each canonical cell population, we first calculate the average Spearman correlation of each filtered cell, , with all cells classified into that particular cell population. Let such correlations with the cells from the cell population y be r1, r2,…,r, where q is the number of filtered cells, K is the number of canonical cell populations, and j = 1,…,K. Let r be the Spearman correlation threshold of cells from the cell population j. The threshold r is the average Spearman correlation of cells in the cell population j. If r>r, then cell ∈ cell-type j. In other words, if correlation of the filtered cell with cells from the given canonical cell population is greater than the correlation threshold in this population, we reallocate that cell to this canonical population. If the filtered cell demonstrates correlation greater than the correlation threshold with one or more other canonical cell populations, we reallocate the cell to the population with which it shows highest correlation. Step 4: Embedding hierarchical tree clustering to detect new cell populations. Graphic clustering requires a transformation of features from high- to low-dimensional feature space by UMAP (uniform manifold approximation and projection for dimension reduction) [33]. The clustering then adopts an explicit approximation of maximum likelihood to estimate the diversity of distribution between the latent representations in the low-dimensional space, at the same time preserving the structure of data in the network of cell-to-cell interactions for the assignment of clusters utilizing HDBSCAN [24]. Dimension reduction by UMAP. UMAP [33] is a novel manifold embedding learning technique for general non-linear dimension reduction. Let denote cells filtering out of canonical cell populations, where x∈R. UMAP works on a weighted graph, denote G = {V, E, w} with vertices, edge E with weight w. For each cell, UMAP first finds its nearest k neighbors by their connection weights, , in the high dimension. The weights w(i, j) depend only on the points in the neighborhoods of x and x with weights w and w; that is, the weights depend on no more than (2k+1) neighbor points. The weight, , is computed from high-dimensional edges for selected neighborhood cells. UMAP employs a manifold learning technique to map the connections of each cell and its neighbors to a low-dimensional graph with weight w′, which maintains the topology of the global structure of the data and distance between cells in the low-dimensional space [33]. Let σ be the diameter of the neighborhood of x, ρ be the distance from x to its nearest neighbor, and A be the weighted adjacency asymmetric matrix of G. We can then make A symmetric by letting: where , d is the measure of dissimilarity and ρ ensures the local connectivity of the manifold. Given two weights, w, w′ in the dataset, the cross-entropy C(w, w′) between them is: where w represents the weights computed from high-dimensional filtered cells and w′, the weights computed from low-dimensional embedding. UMAP optimizes lower-dimensional embedding with respect to cross-entropy by stochastic gradient descent. Structure hierarchical tree clustering by HDBSCAN. A density-based clustering algorithm, HDBSCAN constructs a hierarchical tree of clusters and applies a specific stability measure to extract flat clusters from the tree, providing for the discovery of all the small clusters and revealing obvious outliers or noise [24,44-46]. The HDBSCAN algorithm, detailed by McInnes and colleagues [25], can be abstracted into the following steps [45]. 1) It first finds the points in the epsilon-neighborhood of every point and applies a mutual reachability distance metric [47] to identify the core points with neighbors of more than “minimum cluster size” (denoted by minPts): where d(a, b) is the distance between cells a and b according to the chosen metric, e.g., Euclidean distance, and core(x) is a core distance [45,47]. 2) The algorithm allows representation of the dataset as a graph with data objects as vertices connected by weighted edges with the mutual reachability distances as weights, and the graph is employed to construct a minimum spanning tree (MST). The algorithm then finds the connected components of core points on the neighbor graph, ignoring all non-core points by condensed cluster hierarchy. Sorting the core point edges by mutual reachability distance results in a hierarchical tree structure (dendrogram). 3) By choosing an optimal threshold, epsilon, as a global horizontal cut value and selecting all clusters with at least minPts points at this density level, we can retrieve the HDBSCAN clusters for this epsilon from the hierarchy. HDBSCAN’s selection algorithm traverses the cluster tree from the bottom up, comparing the stability value of each node to the sum of the stability values of its nested subclusters, thereby propagating and updating stabilities as it ascends the tree until the cluster with the highest stability is found and selected on each tree branch. First, the degree of density of each node is defined by its epsilon, or node stability, value. The resulting clusters more closely approximate the hierarchy of the level sets of the true density distribution of the cells [48]. We employed the HDBSCAN algorithm to discovers clusters of variable densities within node groups with high internal-edge density and the clusters most representative of the underlying structure of data in the low-dimensional space. During calibration and dimension reduction, Seurat MultiCCA (Canonical Correlation Analysis) method [49] is used to remove the batch effect prior to use DGCyTOF, which was developed in 2017 by the Satija lab. By CCA, the high-throughput data is projected into a subspace to identify correlations across datasets to reduce data dimensionality and capture the most correlated data features to align the data batches. Step 5: Visualizing cells and cell populations in 3-dimensional space. For visualization, we first used UMAP to project the protein-expression matrix of CyTOF into three dimensions. We added a new label axis, Z, to denote cell types previously annotated in Steps 1 and 4. Projecting the similarity, S, and associated labels into 3D space enabled the intuitive depiction of hidden structures in the data.

Algorithm implementation

Implementation of deep learning: The neural network model is implemented in Python using the PyTorch framework [50]. A feedforward neural network comprises three hidden layers, one containing 128 nodes, another, 64 nodes, and the last, 32 nodes (), and each layer has a rectified linear unit (ReLU) activation function that maps the input vector to non-linear output for the next layer. The output is the softmax layer and has the same number of output nodes as the number of cell types. Softmax assigns decimal probabilities to each class in our multi-class problem. We used the cross-entropy loss function, optimized using adaptive moment estimation [41], with a learning rate of 0.001 and batch size of 256. Other methods and their implementation: All computation algorithms and their comparison were performed in Python with Scikit-learn [51], a library featuring a wide range of machine-learning algorithms. UMAP dimension reduction and HDBSCAN clustering were implemented using their available software [25,33]. The Python system was integrated into the R-Shiny application for visualization. The code and datasets used in this study are available at .

Methodology evaluation measure

1. Evaluation of dimension-reduction performance

Three metrics computation time, neighborhood proportion error (NPE), and residual variance (RV) are used to evaluate performance of dimension-reduction methods–principal component analysis (PCA), factor analysis (FA), independent component analysis (ICA), isometric feature mapping (Isomap), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). (1.1) Computation times All methods were executed on the NVIDIA Tesla P100 GPU cluster provided by the Ohio Supercomputer Center (OSC) to measure time and speed. (1.2) Neighborhood proportion error NPE is used to measure total variation distance between the probabilities of cells’ assignment to the same subtype in original data and embedding [52]. A smaller NPE indicates better “homology” within a cell type. NPE is defined in Formula (9). where, where P and Q represent the empirical density distributions of subtype s in the original data and embedding that in low-dimensional space, respectively. The variable a represents the fraction of k neighbors that belong to the same subtype s as the cell in both the original space P and low-dimensional embedding Q. δ(P, Q) is the total variation distance [53] for each subtype, s∈S, where S is the set of all manually gated subtypes. (1.3) Residual variance Retained variance is a criterion that can be used to choose the appropriate number of principle components to an embedding system in a low dimension space. It represents how much of the information was retained after dimension reduction, such as after PCA. It can be evaluated using residual variance (RV) by measuring preserved pair-wise distances [31]. In Eq (10), D is each method’s best estimate of intrinsic manifold distances from high dimension space; and D is the Euclidean distance matrix in the low-dimensional embedding. R is the standard linear correlation coefficient over all entries of D and D. The lower is retained variance, the better of algorithm performance. When no correlation of distances, residual variance attains its maximum value of 1.

2. Cluster evaluation metrics

Two metrics are introduced to evaluate the clustering results: adjusted random index (ARI) [54] and V-measure [55]. (2.1) Adjusted Rand index (ARI) The ARI metric evaluates whether dimension-reduced similarity cluster results are similar to one other. which is defined by Formula (11) [54] Here, N is the number of data points in a given data set and N is the number of data points of the class label assigned to cluster C in partition P*. N is the number of data points in cluster C of partition P, and N is the number of data points in class . In general, an ARI value lies between 0 and 1. The index value is equal to 1 only if a partition is completely identical to the intrinsic structure and close to 0 for a random partition. The advantage of the ARI is that it makes no assumption about the cluster structure. Therefore, it can be used to compare clustering algorithms, such as k-means and spectral-clustering algorithms. (2.2) V-measure The V-measure is the harmonic mean between homogeneity and completeness as evaluated using a conditional entropy-based external measure of clustering [55]. Homogeneity requires that all clusters contain only cells that belong to a single population, and completeness, that all cells belonging to a given population are elements of the same cluster. Similar to the ARI, the V-measure makes no assumption about the cluster structure.

3. Evaluation of classification performance

(3.1) F-score We evaluated our methods using the balanced F-score (F) as described by Aghaeepour’s group [11]. The F-score for multiple classes is defined as the weighted average of the F-score for each cell type: where C is the number of cells with type i, N is the number of cells, and F is the F-score for the ith cell type versus all other types (including unknown types): where recall denotes how many relevant items are selected, and precision denotes how many selected items are relevant. The four outcomes can be formulated in a 2×2 contingency : An F-score of 1.0 indicates perfect agreement with the labels obtained by the clustering or classification method. (3.2) Receiver operating characteristic (ROC) The ROC curve is used as a measure of the quality of classification consistency between the true and prediction labels. The curve is created by plotting the true-positive rate (TPR) against the false-positive rate (FPR) at various threshold settings, where TPR = TR/(TP+FN), and FPR = FP/(FP+TN), TP, TN, FN and FP see contingency . The area of the curve shows the ROC accuracy. (3.3) Fowlkes-Mallows score (FMI) The FMI is an evaluation metric to evaluate the similarity among clusters obtained after applying different clustering algorithms [55]. The FMI is defined as the geometric mean of pair-wise precision and recall as formular (14). Where TP, TN, FN and FP see contingency With a random classification, the FMI will approach zero. A perfect classification will result in an FMI of 1.

Results

Identification of cell types by DGCyTOF in two datasets

We used DGCyTOF to automatically identify known and new cell populations in two benchmark CyTOF datasets rresulting in F-scores of 0.9921 (CyTOF1) and 0.9992 (CyTOF2) for the identification of labeled cells and cell populations (. shows the results of graphic clustering, in which unlabeled cells not belonging to known populations were split into five clusters in the CyTOF1 dataset, and shows six clusters of new cell populations among those unknown cells in CyTOF2. depict the final cell populations, including new subtypes and calibration (known) cell types, in 3-dimensional space utilizing the DGCyTOF platform.

Fig 2

Cell population identification by DGCyTOF in the analysis of CyTOF1 and CyTOF2 datasets.

Fig 2A identifies the 32 types of known cells by deep classification learning for dataset CyTOF1, and Fig 2C, the 13 types of known cells for CyTOF2. Fig 2B and 2D show the spectral clustering for the identification and visualization of unknown cell populations in the two datasets.

Cell population identification by DGCyTOF in the analysis of CyTOF1 and CyTOF2 datasets.

Evaluation and comparison of methods

External validation of the DGCyTOF model in the identification of known cell types

External validation is used to evaluate the overfitting phenomenon in the deep-learning algorithm. We randomly separated all labeled samples equally into a training set and a validation set for both CyTOF1 and CyTOF2. We used weight regularization to reduce overfitting of DL model. DGCyTOF model’s hyperparameter tuning was done utilizing a regularization l1 normalization to penalize weights sparse to 0 and the top 5% parameters is selected [56]. On the other hand, we designed the dropout and dense strategy for parameters selection to prevent neural networks from overfitting [57,58]. DeepCyTOF selected parameters automatically by its own design. The confusion matrix allows us to visualize the performance of the supervised algorithm. Each row of the matrix represents the instances in an actual class, and each column represents the instances in a predicted class (or vice versa). For example, our algorithm correctly assigned 12057 of 12065 cells in our validation data as CD8-T cells and the other eight incorrectly, as CD4-T, CD16-NK, and Plasma-B cells (). We applied a ROC metric to evaluate the quality of classifier output using 4-fold cross-validation between the true and prediction labels (). Results showed the average performance in the two testing datasets, with 71.9 to 99.4% accuracy and average ROC area of 0.94 to 1.00 in CyTOF1 and 96.5 to 99.8% accuracy and average ROC area of 1.00 in CyTOF2.We compared all prediction results for combination methods with true labels, employing four evaluation criteria to compare the performance of the DGCyTOF and DeepCyTOF methods [37] in classifying labeled data in the CyTOF1 and CyTOF2 databases: F- score (harmonic mean of precision and recall) [11]; ARI (adjusted Rand index, a measure of similarity between two clusters) [59]; FMI (Fowlkes-Mallows scores) [55]; and V-measure (harmonic mean between homogeneity and completeness) [51]. shows the comparison results. All cell-type classifiers used in the DeepCyTOF model were depth-4 feedforward nets, with softplus hidden units and a softmax output layer in which hidden layer sizes were set to 12, 6, and 3. The DGCyTOF parameter setting is detailed in the implementation section.

Comparison of confusion matrices and their associated receiver operating characteristic (ROC) curves for real labels of CyTOF1 and CyTOF2 datasets as assessed utilizing the DGCyTOF model.

(A,B) confusion matrices for the (A) CyTOF1 dataset and (B) CyTOF 2 dataset; (C,D) ROC curves for the (C) CyTOF1 dataset and (D) CyTOF 2 dataset. Note: All arrow indicators showed good trends

Evaluation of embedding-clustering performance

To measure the effectiveness of dimension-reduction techniques that preserve cell proximities, such as cell types, we compared the performances of six techniques with three clustering technologies with respect to computation time, neighborhood proportion error (NPE) [, residual variance (RV), and ability to cluster known cell types in two manually gated benchmark mass-cytometry datasets (CyTOF1 and CyTOF2). We compared these variables among the six techniques–principal component analysis (PCA) [28], factor analysis (FA) [29], independent component analysis (ICA) [30], isometric feature mapping (Isomap) [31], t-distributed stochastic neighbor embedding (t-SNE) [32] and uniform manifold approximation and projection (UMAP) [33]–adding the clustering K-means, Gaussian mixture model, and agglomerative clustering algorithm HDBSCAN by analysis data in CyTOF1 (13 biomarkers, 24 labeled cell types and CyTOF2 (32 biomarkers, 14 labeled cell types). We compared all prediction results of the combination methods with respect to the identification of true labels, whereby the NPE defined the fraction of cells belonging to a specific subtype under a fixed point, such as the nearest neighbor (k), k = 20. To evaluate clustering speed and accuracy, we assessed the F-score [11], ARI [59], FMI, and V-measure [51,55]. Tables show a comprehensive comparison of machine-learning methods by dimension-reduction methods (linear and nonlinear) + clustering methods in our two high-throughput mass-cytometry datasets, in which parts of cell types have been labeled manually. To be comparable, all dimension-reduction methods reduce the dimensionality of a multivariate data to two principal components (2-dimensional embeddings), that can be visualized graphically, with minimal loss of information, here. Note–All arrow indicators showed good trends. ARI, adjusted Rand index, measure of similarity between two clusters, involves random labeling independent of the number of clusters; DD, difficult to distinguish; ED, easy to distinguish; FMI, Fowlkes-Mallows score, geometric mean of pair-wise precision and recall; F-score, harmonic mean of precision and recall (values range from 0 [bad] to 1 [good]); NPE, neighborhood proportion error; V-measure, harmonic mean of homogeneity and completeness. All results reflect comparison of two dimensions, and the number of nearest neighbors (k) is 20. Note–All arrow indicators showed good trends. ARI, adjusted Rand index, measure of similarity between two clusters, involves random labeling independent of the number of clusters; DD, difficult to distinguish; ED, easy to distinguish; FMI, Fowlkes-Mallows score, geometric mean of pair-wise precision and recall; F-score, harmonic mean of precision and recall (values range from 0 [bad] to 1 [good]); NPE, neighborhood proportion error; V-measure, harmonic mean of homogeneity and completeness. All results reflect comparison of two dimensions, and the number of nearest neighbors (k) is 20. In speed comparison, because naive t-SNE applications suffer such severe shortcomings as a limited number of cells for analysis and low speed [19,60]. In our experience, the larger the data set, the more severe this problem. Here, we decided to measure the run-time of each of three random sub-sampled datasets CyTOF1 and CyTOF2, each consisting of 10,000 cells, using the average of the three times for each technique as our final computation time comparison. UMAP and t-SNE are both non-linear graph-based methods and have become an extremely popular technique for visualizing high dimensional data. By these cells, our experiment displays the UMAP speed is averaging around 3–4 times faster than t-SNE, 18.806 versus 94.466 seconds for CyTOF1 and 16.944 versus 95.609 seconds for CyTOF2 (Tables ). Other dimension reduction methods PCA, FA, ICA and Isomap involved a singular value decomposition (SVD) based on matrix factorization decomposition. Their speed is much fast than other dimension reduction methods with 0.017, 3.618, 0.060, and 0.020 seconds for CyTOF1 and 0.0253, 0.028, 0.054 and 0.021seconds for CyTOF2 (Tables ). However, they do not fare well in clustering visualization and accuracy in classification. Accuracy, judged better by a smaller NPE and visualization (), were very similar between the two datasets for UMAP and t-SNE, with NPE variance of 0.462 versus 0.423 for CyTOF1 0.399 versus 0.393 for CyTOF2 0.393. We therefore conclude that graphic models are important technologies that fit well in the analysis of single-cell CyTOF data. The accuracy in identifying cell homology was much higher using the nonlinear structure methods, UMAP and t-SNE, than any of the other methods, PCA, FA, ICA, and Isomap.

Visualizations of cell populations in databases CyTOF1 and CyTOF2.

(A-B) Two-dimensional visualization of embedding of cells for the identification of dimension-reduction techniques in databases CyTOF1 (A) and CyTOF2 (B). Cell subtypes are labeled by DGCyTOF for databases CyTOF1 (C) and CyTOF2 (D). In accuracy comparison of dimensional reduction with clustering methods, our graphic clustering HDBSCAN+UMAP in DGCyTOF is an extension of the HDBSCAN clustering method in the low-dimension space of UMAP. The graphic clustering shows the highest accuracy in clustering accordance for real labels with F-scores of 0.924 (92.4%) for CyTOF1 data () and 0.981 (98.1%) for CyTOF2 data (). In contrast, other embedding clustering systems demonstrate average F-scores of 56.45% to 65% () and 58.8 to 80% (UMAP+Gaussian mixture model). Whereas the F-scores is nearly 0.352 on average to methods PCA, FA, ICA, Isomap with or without k-means clustering and Gaussian mixture clustering. That means any of the techniques used independently reaches 35.2% accuracy in the identification of cell types except to t-SNE. Although t-SNE+HDBSCAN improved the performance of clustering accuracy F-scores to 0.771 for CyTOF1 () and 0.923 for CyTOF2 (). To other metric measurements ARI, FMI and V-measure, we obtained the similar results and conclusions as F-score. These results illustrate that DGCyTOF use integration technology HDBSCAN+UMAP, which can improve the entire performance of cell identification in CyTOF data by approximately 35%. Overall, DGCyTOF, which integrates UMAP and HDBSCAN technology, outperforms two state-of-the-art algorithms, those of the t-SNE/UMAP+k-means method and t-SNE/UMAP+Gaussian mixture model. This can be attributed primarily to the requirement of the dimension-reduction method to make fewer assumptions about the nature of cell-to-cell relationships for any given analysis, thereby limiting the accuracy of traditional clustering methods, like the k-means method and Gaussian mixture model, for this type of data. In HDBSCAN, an agglomerative clustering algorithm organizes clusters of points into a minimal spanning tree (MST) rather than connecting clusters in a highly connected graph structure in which all connected clusters can create hierarchy relationships between clusters and/or cells in the population. This method sharply improves the accuracy of cell identification. allows detailed visualization of these phenomena. Another interesting result is that the performance of data dimensionality reduction+clustering is much better than clustering directly without dimension reduction. The result illustrates that dimensionality reduction contributes to main signal extraction during clustering or classification (Tables ).

Visualization of cell populations: Clustering and differentiation

In 2-dimensional embeddings, researchers are generally interested in the observation of phenotypic clusters and the ease to distinguish them. Phenotypic clustering involves the clustering of cells with the same manually gated subtypes to permit the identification of distinct populations with clear boundaries for known cell types and their differentiation. The visualizations in show the variation among the different dimension-reduction techniques with respect to their abilities to permit identification of phenotypic clusters and allow easy differentiation. Two-dimensional visualization and cell annotation of the 32-dimensional data of CyTOF 1 () and 12-dimensional data CyTOF 2 (). show the results for all 3D visualization utilizing our DGCyTOF tool. Different dimension-reduction methods combined with different clustering technologies display distinct clustering and observable differentiation patterns. Clusters by PCA, FA, or ICA do not demonstrate the clear separation of cell subtypes; significant mixing between the different populations makes this embedding ineffective for subtype classification. Neither do they clearly distinguish cluster boundaries along which we can observe the process of cell differentiation in CyTOF1 and CyTOF2 data. Clusters by Isomap show similarly extensive mixing of different cell subtypes, displaying very little distinct clustering and no observable differentiation patterns. Clusters by t-SNE, however, shows the clustering of cell subtypes so that they may be easily distinguished for the differentiation of cells. UMAP also demonstrates clearly defined clusters, with large gaps between some distinct groups, and DGCyTOF distinguishes cell subtypes for easy differentiation. Cell classification and clustering in 3D feature space enables clear and easy discernment of each subtype, including newly identified subtypes. From these observations, we conclude that our 3D visualization technique allows the accurate and intuitive observation of cell-population distribution and detection of cell-to-cell distribution. Visualization using the t-SNE method is superior to that of other methods in displaying different cell subtypes, but that in UMAP allows ease in distinguishing clusters that enhances differentiation.

Feedback-calibration simulation between known cell populations and unknown cells

Feedback-loop learning system is designed to calibrate cell types between inner known-cell type and intra-unknown cell population to improve the inner cell-type homology (). Spearman correlation is used to calculate cell-to-cell correlation within clusters, and a threshold value averaging the Spearman correlation determines whether a cell belongs to an unknown cell population. We classified cell-type homology within cells and utilized the feedback-loop learning after deep-learning classification to correct the cell types between known and unknown population. illustrates the correction process in CyTOF1 and CyTOF2 data analysis. shows the average Spearman correlation coefficient, r, before and after calibration learning and indicates the improvement of many of the classifications of inner cell-type homology (highlighted in bold) after calibration. We can see the cell type homology inner a cluster have an improvement although it is not much.

Conclusions

Recent advances in mass cytometry (CyTOF) have radically altered the fate of single-cell proteomics by allowing a more accurate understanding of complex biological systems and identify novel cellular subsets [4]. Mass cytometry allows analyses of cells in suspension such as blood but also extended for the analyses of tissue sections. New calculational technology needs in dealing with such a big data to characterize the complex cellular samples’ types, where rare cell populations with essential biological function would otherwise be missed. For the first time, we propose DGCyTOF method by integrating the advantages of both the classifier and clusters strategies in identification of known and new cell types according to relative protein abundances from cytometry data. The DGCyTOF method allows automatic and highly accurate assignment of known labels to single cells using deep learning, detects new cell populations utilizing a novel graphic-clustering technique, and employs the guidance of a calibration system to achieve an optimum balance of accuracy. Guided by a feedback calibration system, the model seeks optimal accuracy balance among calibration cell populations and unknown cell types, yielding a complete and robust learning system that is highly accurate in the identification of cell populations compared to results using other methods in the analysis of single-cell CyTOF data. Compared to other methods in the analysis of two single-cell CyTOF standard datasets, DGCyTOF represents a robust complete learning system with high accuracy and speed in cell-population identification by comparing with popular dimension-reduction techniques PCA, FA, ICA, Isomap, t-SNE and UMAP with k-means clustering and Gaussian mixture clustering technology that make minimal assumptions about the nature of relationships between the input cells. We used metric measurements computational speed, ARI, FMI, F-score, NPE, V-measure and visualization to assess the quality and utility of reduction in comparison. The DGCyTOF displayed a highly accurate assignment of labels in detecting cell populations. In particular, the DGCyTOF obtains the best performance in the running speed of different algorithm comparison. In addition, observation of cell-population distribution was more intuitive in the 3D visualization in DGCyTOF than t-SNE and UMAP visualization. We believe the novel DGCyTOF will place cells into functionally distinct groups and types and allow for detailed analyses of cellular heterogeneity not only for calibration cell types, but for new rare cell types. The DGCyTOF hold great potential to uncover the tissue and immune system’s cellular variation patterns and functionality by these inferring cell types. Application of the DGCyTOF method to identify cell populations could be extended to the analysis of single-cell RNASeq data and other omics data [4,61]. 10 May 2021 Dear Dr. Cheng, Thank you very much for submitting your manuscript "DGCyTOF: deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Kathryn Miller-Jensen, Ph.D. Associate Editor PLOS Computational Biology Jason Haugh Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In the manuscript titled 'DGCyTOF: deep learning with graphic cluster visualization to predict cell types of single-cell mass cytometry data", the authors created a new pipeline to classify single cells from CyTOF data into known cell types. The pipeline also allows users to identify clusters that cannot fit into known cell types. While some ideas of the pipeline are novel, such as the calibration step to refine to cell populations, I have many concerns about the proposed method. 1. The authors showed the performance of the classification using two datasets. However, in each experiment, the reported performance is based on training and validation data spit from the same dataset. How will the model perform in the presence of batch effects? The whole purpose of deep-learning is to learn a generalizable model from one labeled dataset and apply the model to classify data from future experiments/studies. 2. An important use of automated clustering is to identify new cell subsets within known cell types. For example, there may be multiple cell subsets within CD8+ T cells, and their proportions could be associated with a disease. Why only limiting the cell clustering to cells with unknown label? 3. While UMAP performs the best when preserving the data's local structure, there will still be information loss during dimension reduction. It is not clear to me why the authors did not choose to perform clustering using original data. Can the author add another column in table 4 and 5 showing clustering performance without dimensionality reduction? 4. The pipeline includes several hyper-parameters, including the number of layers in the DL model, the number of nodes in each layer, the threshold for softmax probability, and the correlation threshold for calibration (not to mention the hyperparameters in HDBSCAN and UMAP). How are the hyperparameters chosen? Is the test dataset used for hyper-parameter selection? 5. The authors shared the package in Github. However, I did not see any documentation other than the installation guide. A concise vignette will be helpful for users to get started. 6. The authors mentioned that "However, relying on clustering is laborious since it often involves manual annotation, which significantly limits the reproducibility of identifying cell-populations across different samples." It is unclear how the proposed clustering method (UMAP + HDBSCAN) improves the reproducibility of identifying cell populations across different samples. Is the clustering stable if applied to individual samples? Reviewer #2: The authors presented a deep learning framework, termed DGCyTOF, for embedding visualization of single-cell datasets. Specifically, DGCyTOF combines deep learning classification and hierarchical stable-clustering methods to sequentially build a tri-layer construct for known cell types and the identification of new cell types. Based on systematic evaluation, the authors showed a higher performance of the proposed DGCyTOF compared to Principal Component Analysis (PCA), Factor Analysis (FA), Independent Component Analysis (ICA), Isometric Feature Mapping (Isomap), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) with k-means clustering and Gaussian mixture clustering. Overall, this is a novel study, which provides a powerful deep learning approach for single-cell data visualization. The manuscript is well-written and the overall computational evaluation is solid. Several specific minor comments should be considered further. One major concern is hyperparameter tuning for DGCyTOF and other comparing approaches, including PCA, UMAP, t-SNE, and others. The authors are suggested to provide the detailed process for hyperparameter tuning. As shown figure 3, the authors illustrated AUC =1 for most cell types. The authors are suggested to discuss possible risk of over-fitting. The authors are suggested to discuss future studies about how to improve biological interpretation of deep learning models. Finally, computational cost and complexity of DGCyTOF should be compared with other approaches as well, such as PCA, UMAP, t-SNE, and others. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 25 Aug 2021 Submitted filename: PCOMPBIOL-D-21-00432_comments_response.docx Click here for additional data file. 28 Sep 2021 Dear Dr. Li, Thank you very much for submitting your manuscript "DGCyTOF: deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Kathryn Miller-Jensen, Ph.D. Associate Editor PLOS Computational Biology Jason Haugh Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In the revised version of the manuscript titled "DGCyTOF: deep learning with graphic cluster visualization to predict cell types of single-cell mass cytometry data", the authors have addressed most of my concerns by adding new evaluation results and more detailed descriptions of the methods. However, two of my questions remain to be answered. First, how does DGCyTOF deal with the batch effect? In the point-by-point response, the author has mentioned using CCA (Canonical Correlation Analysis) to adjust for batch effect. However, I can't seem to find the description of CCA in the manuscript. Second, it is still unclear to me how DGCyTOF is optimized. For example, how does the author determine the number of layers in the neural network and the number of nodes in each layer? The authors also mentioned using l1 regularization and drop-outs. How were the penalization weight (lambda) and the drop-out rates determined? Is there a cross-validation procedure for model optimization? Reviewer #2: The reviewer appreciated the extensive revision and the authors have addressed all concerns. The reviewer support its publication at PLoS Computational Biology. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 13 Dec 2021 Submitted filename: PCOMPBIOL-D-21-00432R1_comments_response.docx Click here for additional data file. 16 Dec 2021 Dear Dr. Li, We are pleased to inform you that your manuscript 'DGCyTOF: deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Kathryn Miller-Jensen, Ph.D. Associate Editor PLOS Computational Biology Jason Haugh Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have answered all my remaining questions. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 29 Mar 2022 PCOMPBIOL-D-21-00432R2 DGCyTOF: deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data Dear Dr Li, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Anita Estes PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Table 1

Two CyTOF benchmark data sets for analysis.

Database	No. of Cells	No. of markers	No. of manually gated populations	No. of manually gated cells (label data)
CyTOF1	167,004	13	24	81,747
CyTOF2	265,627	32	14	104,184

Table 2

Contingency table for calculating the receiver operating characteristic curve.

Total population	Condition positive	Condition negative	Prevalence
Predicted condition positive	TP	FP	precision = TR/(TP+FP)
Predicted condition negative	FN	TN	precision = TR/(TP+FP)
	recall = TR/(TP+FN)	specificity = TN/(FP+TN)

Table 3

Comparison of methods for averaging performance in the identification of known cell types in training and testing data by different measurements for CyTOF1 and CyTOF2 datasets.

Measurement	DGCyTOF		DeepCyTOF
Measurement	CyTOF1	CyTOF2	CyTOF1	CyTOF2
F–score ↑	0.9921	0.9992	0.9925	0.999
ARI ↑	0.9924	0.9991	0.992	0.9981
FMI ↑	0.9932	0.9993	0.993	0.9992
V-measure ↑	0.9822	0.987	0.9931	0.986

Note: All arrow indicators showed good trends

Table 4

Comparison of machine-learning methods by different measurements for CyTOF Dataset 1 (13 biomarkers, 24 labeled cell types).

Methods	Measurement	PCA	FA	ICA	Isomap	UMAP—DGCyTOF	t-SNE	NO-reduction
Dimension- reduction method	Computation time (in seconds) ↓	0.017	3.618	0.060	0.020	18.806	94.466
	NPE ↓	0.626	0.633	0.625	0.626	0.462	0.423
	Retained variance ↓	40.6%	33.4%	-	40.61%	N/A	N/A
	Visualization ↑	DD	DD	DD	DD	DD	ED
k-means clustering	F-score ↑	0.286	0.282	0.288	0.269	0.565	0.507	0.286
	ARI ↑	0.236	0.228	0.236	0.216	0.556	0.483	0.235
	FMI ↑	0.307	0.298	0.307	0.286	0.627	0.563	0.306
	V-measure ↑	0.494	0.488	0.495	0.477	0.793	0.762	0.494
Gaussian mixture model	F-score ↑	0.305	0.304	0.505	0.285	0.588	0.502	0.530
	ARI ↑	0.247	0.258	0.436	0.229	0.538	0.493	0.494
	FMI ↑	0.317	0.326	0.544	0.298	0.608	0.573	0.556
	V-measure ↑	0.497	0.490	0.586	0.481	0.790	0.768	0.704
HDBSCAN—DGCyTOF	F-score ↑	0.442	0.451	0.442	0.438	0.924	0.771	0.596
	ARI ↑	0.336	0.349	0.337	0.332	0.915	0.738	0.534
	FMI ↑	0.526	0.531	0.524	0.524	0.925	0.789	0.621
	V-measure ↑	0.557	0.570	0.552	0.555	0.905	0.850	0.557

Note–All arrow indicators showed good trends. ARI, adjusted Rand index, measure of similarity between two clusters, involves random labeling independent of the number of clusters; DD, difficult to distinguish; ED, easy to distinguish; FMI, Fowlkes-Mallows score, geometric mean of pair-wise precision and recall; F-score, harmonic mean of precision and recall (values range from 0 [bad] to 1 [good]); NPE, neighborhood proportion error; V-measure, harmonic mean of homogeneity and completeness. All results reflect comparison of two dimensions, and the number of nearest neighbors (k) is 20.

Table 5

Comparison of machine-learning methods by different measurements for CyTOF Dataset 2 (32 biomarkers, 14 labeled cell types).

Methods	Measurement	PCA	FA	ICA	Isomap	UMAP—DGCyTOF	t-SNE	NO-reduction
Dimension- reduction method	Computation time (in seconds) ↓	0.0253	0.208	0.054	0.021	16.944	95.609
	NPE ↓	0.536	0.525	0.535	0.536	0.399	0.393
	Retained variance ↓	31.03%	27.70%	-	31.03%	N/A	N/A
	Visualization ↑	DD	DD	DD	DD	ED	ED
K-means clustering	F-score ↑	0.426	0.421	0.431	0.322	0.590	0.529	0.458
	ARI ↑	0.343	0.330	0.349	0.232	0.540	0.475	0.409
	FMI ↑	0.444	0.431	0.448	0.340	0.637	0.578	0.537
	V-measure ↑	0.609	0.582	0.611	0.444	0.799	0.744	0.735
Gaussian mixture model	F-score ↑	0.497	0.446	0.670	0.313	0.626	0.585	0.395
	ARI ↑	0.406	0.353	0.573	0.221	0.577	0.534	0.339
	FMI ↑	0.500	0.453	0.706	0.331	0.665	0.631	0.461
	V-measure ↑	0.636	0.589	0.690	0.443	0.807	0.785	0.684
HDBSCAN—DGCyTOF	F-score ↑	0.669	0.650	0.665	0.560	0.981	0.923	0.684
	ARI ↑	0.573	0.547	0.569	0.417	0.977	0.907	0.601
	FMI ↑	0.696	0.680	0.691	0.616	0.981	0.923	0.696
	V-measure ↑	0.698	0.657	0.691	0.565	0.952	0.898	0.701

Table 6

Calibration of cell types utilizing calibration feedback for CyTOF1 and CyTOF2 data.

CyTOF1 data	Coefficient (r)		CyTOF2 data	Coefficient (r)
Cell type	Before	After	Cell type	Before	After
CD11b-_Monocyte_cells	0.6627	0.6657	Basophils	0.6094	0.613
CD11bhi_Monocyte_cells	0.7261	0.7261	CD16-_NK_cells	0.5474	0.5481
CD11bmid_Monocyte_cells	0.6666	0.6696	CD16+_NK_cells	0.6138	0.617
CMP_cells	0.4809	0.4864	CD34+CD38+CD123-HSPC	0.6346	0.6403
Erythroblast_cells	0.3733	0.3756	CD34+CD38+CD123+HSPC	0.6658	0.6992
GMP_cells	0.5715	0.5796	CD34+CD38lo_HSCs	0.5879	0.5942
HSC_cells	0.5544	0.5734	CD4_T_cells	0.6095	0.6096
Immature_B_cells	0.3899	0.3932	CD8_T_cells	0.6247	0.6249
Mature_CD38lo_B_cells	0.4863	0.4866	Mature_B_cells	0.6806	0.6806
Mature_CD38mid_B_cells	0.5594	0.5614	Monocytes	0.6925	0.6926
Mature_CD4+_T_cells	0.5155	0.517	pDCs	0.6511	0.6568
Mature_CD8+_T_cells	0.5916	0.5935	Plasma_B_cells	0.6055	0.6148
Megakaryocyte_cells	0.2805	0.2854	Pre_B_cells	0.6462	0.6475
MEP_cells	0.6374	0.6492	Pro_B_cells	0.6837	0.6914
MPP_cells	0.4966	0.5041
Myelocyte_cells	0.3919	0.3927
Naive_CD4+_T_cells	0.6915	0.6931
Naive_CD8+_T_cells	0.6891	0.6907
NK_cells	0.4645	0.4656
Plasma_cell_cells	0.4622	0.4638
Plasmacytoid_DC_cells	0.6214	0.6388
Platelet_cells	0.4867	0.5078
Pre-B_I_cells	0.559	0.5657
Pre-B_II_cells	0.5436	0.5456
Cell-type homology	0.537608	0.542942		0.632336	0.637857

34 in total

1. A global geometric framework for nonlinear dimensionality reduction.

Authors: J B Tenenbaum; V de Silva; J C Langford
Journal: Science Date: 2000-12-22 Impact factor: 47.728

2. Interpreting flow cytometry data: a guide for the perplexed.

Authors: Leonore A Herzenberg; James Tung; Wayne A Moore; Leonard A Herzenberg; David R Parks
Journal: Nat Immunol Date: 2006-07 Impact factor: 25.606

3. What is principal component analysis?

Authors: Markus Ringnér
Journal: Nat Biotechnol Date: 2008-03 Impact factor: 54.908

4. Global nonlinear kernel prediction for large data set with a particle swarm-optimized interval support vector regression.

Authors: Yongsheng Ding; Lijun Cheng; Witold Pedrycz; Kuangrong Hao
Journal: IEEE Trans Neural Netw Learn Syst Date: 2015-05-11 Impact factor: 10.451

Review 5. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

Review 6. Identifying cell populations with scRNASeq.

Authors: Tallulah S Andrews; Martin Hemberg
Journal: Mol Aspects Med Date: 2017-07-25

Review 7. Mass Cytometry: Single Cells, Many Features.

Authors: Matthew H Spitzer; Garry P Nolan
Journal: Cell Date: 2016-05-05 Impact factor: 41.582

8. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis.

Authors: Jacob H Levine; Erin F Simonds; Sean C Bendall; Kara L Davis; El-ad D Amir; Michelle D Tadmor; Oren Litvin; Harris G Fienberg; Astraea Jager; Eli R Zunder; Rachel Finck; Amanda L Gedman; Ina Radtke; James R Downing; Dana Pe'er; Garry P Nolan
Journal: Cell Date: 2015-06-18 Impact factor: 41.582

Review 9. Experimental design for single-cell RNA sequencing.

Authors: Jeanette Baran-Gale; Tamir Chandra; Kristina Kirschner
Journal: Brief Funct Genomics Date: 2018-07-01 Impact factor: 4.241

10. flowCore: a Bioconductor package for high throughput flow cytometry.

Authors: Florian Hahne; Nolwenn LeMeur; Ryan R Brinkman; Byron Ellis; Perry Haaland; Deepayan Sarkar; Josef Spidlen; Errol Strain; Robert Gentleman
Journal: BMC Bioinformatics Date: 2009-04-09 Impact factor: 3.169