Literature DB >> 36191000

HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis.

James Anibal¹, Alexandre G Day², Erol Bahadiroglu¹, Liam O'Neil³, Long Phan¹, Alec Peltekian¹, Amir Erez¹, Mariana Kaplan³, Grégoire Altan-Bonnet¹, Pankaj Mehta².

Abstract

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36191000 PMCID： PMC9560626 DOI： 10.1371/journal.pcbi.1010349

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

This is a PLOS Computational Biology Software paper.

Introduction

Technological advances have made it possible to collect huge single–cell datasets with numerous features e.g. transcriptomic profiles (scRNAseq) and/or protein expression (Flow cytometry, CyTOF, CyTEK or CiteSEQ). The ability to cluster such large, high-dimensional datasets is important for a variety of data-intensive fields ranging from biology to data mining. For this reason, there is a crucial need for fast, tunable and scalable clustering algorithms that work well in a high-dimensional setting on limited computational resources and that are at the same time reliable and robust. Clustering large high-dimensional datasets presents several challenges. Clustering in high dimensions suffers from the “curse of dimensionality” [1, 2]. The number of parameters in model-based clustering methods explodes and it becomes difficult to obtain accurate density estimates, a crucial ingredient in density-based clustering algorithms. Since many features are irrelevant or noisy, it is hard to construct meaningful similarity measures [3]. In order to circumvent these problems, one often needs to perform dimensional reductions and/or construct large similarity graphs, but both of these methods are computationally prohibitive for large datasets. Another difficult component of clustering is using the data to automatically learn the number of ideal clusters. Finally, arguably the most controversial and subjective part of clustering is validating clustering assignments when no ground-truth labels are available. In scientific research, clustering algorithms provide many advantages and are widely used. These algorithms are easy to use, readily explainable, and do not require any form of annotated data. Nonetheless, clustering does have a significant downside: the requirement of prior knowledge about the data. A common task in modern biology is to cluster individual cells into distinct cell populations based on measurements in a high-dimensional feature space (mRNA or protein levels). Here, we focus largely on mass cytometry data (CyTOF) which allow measurements of more than 30 features, often for tens of millions of cells in a single sample [4, 5]. When performing exploratory analysis of unsupervised data, researchers often do not know the specificity (i.e. depth of clustering) needed to derive biologically-meaningful populations. A rare yet important population of cells may be lost when there are insufficient clusters. Conversely, it may be beneficial to study larger, well-understood populations that yield immediately actionable insights with robust statistical features. To perform this type of comprehensive analysis, multiple sets of clusters must be generated across a broad spectrum of specificity. For currently available methods (i.e., Parc, Phenograph, FlowSOM), this requires training multiple models with different parameters to control the number of clusters [5-7]. In this scenario, time and/or memory complexity of these algorithms generate a significant problem, particularly in the area of cluster prediction. In the case of training, the solution is to perform dimensional reduction of the feature space (e.g. PCA the features) as well as possibly downsampling the data [5, 8]. In the context of single-cell analysis, there are a number of disadvantages to downsampling the dataset. While it is straightforward to identify generalizable cell populations with downsampled data, downsampling makes it difficult to identify nuanced perturbations within populations. In addition, statistically significant biological features can also be lost. For example, biologists often try to identify biomarkers by training a supervised model for separating healthy cells from diseased cells. However, if the training data set is too small (i.e. there is too much downsampling), the trained model will exhibit severe overfitting because the downsampled data is not representative of true biological distributions. One potential solution is to this problem is to train a clustering model on a downsampled dataset and use this model to predict labels on new data not seen by the model. Unfortunately, many of the most commonly used algorithms used for identifying cell clusters are not amenable to this approach because they are transductive methods: the underlying model changes when considering a new data point. As such, Parc and Phenograph lack the ability to predict labels for newly acquired data points. HDBSCAN [9], allows for label prediction after the model has been trained by estimating the position of the new point within a fixed hierarchical tree. To do this, the feature space of the test data must be reduced to match the training data. However, dimensionality reduction on a 25.106x40 matrix (a typical dataset acquired by mass cytometry [5]) will overwhelm the computing resources available to most biologists. Furthermore, prediction becomes more computationally expensive as the amount of training data is increased. This is because labeling is performed by nearest neighbour search rather than inputting the data into a learned function. In this paper, we introduce a hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to overcome these challenges. Our algorithm builds upon the idea that clustering can be viewed as a supervised learning problem where the goal is to predict the “true class labels” from data [10-12]. Unlike these earlier works that rely on the ideas of cluster stability, we operationalize this concept by training an expressive supervised learning model (i.e., support vector machines, random forests) to evaluate potential cluster assignments and by using out-of-sample performance as a metric of clustering goodness. By training a sufficiently expressive supervised model, we can ensure that the predictive power is limited by the reproducibility of our clustering assignments and not by the choice of classifier [13, 14]. Finally, we emphasize that using out–of–sample performance of expressive classifiers for clustering can be thought of as a generalization of density clustering to high-dimensional space, where direct density estimation is extremely hard. We empirically demonstrate that out-of-sample performance of a classifier trained on cluster labels is strongly correlated with the goodness of the clustering in the over-clustering regime. We leverage this relationship to generate robust clusters which generalizes well and enables the analysis of large biological datasets. The output of our algorithm is a predictive classifier that encodes our clustering assignments. Our work makes 3 primary contributions to unsupervised learning: Since we can easily predict labels with supervised classifiers, we can extensively down-sample our data when performing computationally expensive tasks such as dimensionality reduction and the construction of similarity matrices (tasks which scale as O(n2) or O(n log n)). Through the use of inductive learning methods, we can predict new data points at a minimum computational cost by simply applying the trained classifier on the full dataset. Moreover, we train these classifiers directly on unreduced data (i.e. without a dimensional reduction step), removing the need for expensive reduction of the test data. Our model stores the classifiers corresponding to different out-of-sample error values. Thus, we can obtain multiple sets of cluster labels with a single model. We have converted our model into a simple python package such that users can easily generate multiple clustering, simply by specifying the desired value for the minimum out-of-sample error (intuitively, more subtle clusters lead to high error values). HAL-x removes the need for pre-processing the entire dataset or training a model on the entire dataset. Moreover, HAL-x allows rapid, scalable prediction of new points. Finally, a single trained HAL-x model can generate multiple clusterings at varied depths to account for the specificity/sensitivity trade-off. Our results show that supervised linkage methods have great potential in analyzing high-dimensional data with limited computational resources that are typically available to experimental biologists (i.e., 8 core laptop computers).

Related work

Clustering has been extensively studied in the statistics and machine learning literature (see [3, 15] for full review). Prominent methods for clustering include density-based methods [16, 17], spectral clustering methods [18], and model-based clustering [2]. In the high-dimensional context, several algorithms have been developed that seek to rely on clustering on a subspace of features [19]. Other methods rely on dimensional reduction methods such as Principal Component Analysis (PCA) and Local Discrimination Analysis (LDA) in conjunction with K-mean clustering (often in an iterative manner) [20-22]. There has also been considerable performance on extending methods that generalize manifold learning and spectral clustering to a high-dimensional setting [23, 24]. Several algorithms have been proposed for clustering in high-dimensions by first projecting the data using non-linear embeddings using such as t-SNE and then performing clustering in the low-dimensional space [25]. A drawback of these approaches is that low-dimensional representations can compress the data in such a way that spurious clusters may form in the low-dimensional representation [26], making it extremely challenging to identify true clusters from artifacts of visualizations. Additionally, there exists specialized methods developed specifically for high-dimensional biological data such as Phenograph [5], PARC [7], FlowSOM [6] (see [27] for a comparison of some of these methods). However, these generally tend to require extensive computational resources to scale to large datasets (see analysis below). More recently, state-of-the-art results were obtained by the DEC algorithm which combines deep embeddings with k-means clustering on the compressed, encoded representation [28]. The encoder weights and the means of the k-clusters were fine–tuned using a clever stochastic gradient descent–based training procedure. A particularly attractive aspect of this algorithm is that the cluster labels for new points can be calculated quickly by using the trained encoder to map points to the embedding space and finding the closest cluster center. However, an important drawback of this approach, shared by all k-means based approaches, is that it requires choosing the number of clusters by hand. This is often difficult to determine ahead of time for complex, high-dimensional data and retraining the deep encoder for different k can be time-consuming. Here we show how one can combine powerful non-linear embedding techniques, density clustering, and supervised methods that enforce self-consistency of cluster labeling to build a fast, accurate high-dimensional clustering algorithm that scales well to large datasets.

Out-of-sample error captures clustering performance

Consider the problem of clustering n points that live in a d-dimensional feature space. A (hard) clustering C(x) maps each point in the feature space to a set of discrete labels in a set K, with the cardinality of K equal to the number of possible cluster labels |K| = k. For example for k-means clustering, given a set of k-centroids , the cluster labels are determined by labelling each data point by the closest cluster centroid. Empirical and theoretical arguments suggest that meaningful clusterings C(x) should not differ much when learned using different subsets of the data and different measure of stability have been proposed to quantify this idea [10–12, 29]. Here, we extend this idea to argue that good clustering C(x) are maps that should generalize well. That is, given a clustering assignment generated on a subset of the data (a training data set), one should be able to accurately train a classifier to predict the cluster labels on unseen data (a test set). An important caveat is that the classifier we train should be expressive enough so that the predictive power of a classifier is limited by the properties of the clustering map rather than the classifier. It was recently shown that large, neural networks can achieve zero training error on random labels [13], suggesting that it is always possible to find such a classifier even for complex data. In practice, we have found that any reasonably powerful supervised learning algorithm (kernel-based SVMs, Random Forests, small neural networks) are sufficiently powerful to assess the goodness of clustering. The intuition for using generalizability as a criteria is straightforward: clustering labels should reflect the underlying structure of the probability distribution from which the data is generated and we should be able to train an expressive classifier to learn this structure. In order to understand why out-of-sample accuracy of expressive classifiers is a good proxy-metric for clustering accuracy, we must first nuance two distinct clustering regimes: the overclustering and underclustering regime:

Overclustering (Fig 1a)

In the overclustering regime the number of predicted clusters is greater or equal than the number of true clusters. In such a situation, a classifier trained on clustering labels will generate decision boundaries across regions of relatively high density. Generating a decision boundary in a high density region increases the probability of overfitting and as such reduces the out-of-sample accuracy. This is precisely why out-of-sample accuracy is a good proxy for goodness of clustering.

Overclustering (nCluster = 4) a) versus underclustering (nCluster = 2) b).

The ground truth has 3 distinct clusters. The black lines represent the decision boundaries obtained via a random forest classifier trained on the clustering labels. a) Using over clustered labels for fitting a classifier leads to overfitted decision boundaries that separate regions of high density (red ang green region separation). b) For underclustered labels, the classifier groups together two clusters separated by a region of low density.

Underclustering (Fig 1b)

In the underclustering regime the number of predicted clusters is smaller than the number of true clusters. This is a regime where expressive classifiers will indicate good performance but will not necessarily correlate with the goodness of clustering with respect to the ground truth. In Fig 2, we empirically tested the relationship between out-of-sample accuracy and the goodness of clustering in the overclustering regime. We use 6 benchmark datasets from scikit-learn [30] and for each dataset generate multiple clustering assignments by tuning the hyper-parameters of various out-of-the box clustering methods from scikit-learn (see S1 Methods). For each generated clusterings, we train a random forest classifier and evaluate its accuracy on an hold-out test set. The out-of-sample accuracy shows a strong monotonic relationship with the weighted F-score. Moreover, we find a pearson correlation coefficient between the weighted F-score and the logarithm of the out-of-sample accuracy is 0.85 across the 225 sampled combination of datasets and clusterings.

Fig 2

Out-of-sample accuracy vs. weighted F-score (comparison to ground-truth) for various clustering algorithms.

Out-of-sample accuracy vs. weighted F-score (comparison to ground-truth) for various clustering algorithms.

We used out-of-the-box clustering algorithms DBSCAN, spectral clustering, K-means and Meanshift with various hyper-parameters and clustered 6 different benchmark two-dimensional datasets: Noisy Circles, Gaussian mixtures with different covariance structures, non-convex clusters (Noisy Moons) and a random noise map (No Structure). These datasets were taken from scikit-learn clustering methods as benchmark datasets (see S1 Methods for more details). The out-of-sample accuracy is computed by training a random forest classifier with 100 estimators on the clustering labels provided by the clustering algorithms. We use a 80/20 train/test split to train the classifiers and evaluate the out-of-sample error. Notice that the F-score and out-of-sample accuracy as measured are monotonically related. We started by considering simple low-dimensional datasets designed to highlight the caveats and strengths of standard clustering methods [see scikit-learn]. For each dataset, we performed clustering using different clustering methods (DBSCAN, Spectral clustering, k-means, Meanshift) with varying parameters (see S1 Methods for the parameters used). The results are presented in Fig 2, where we plotted the out-of-sample accuracy of the trained classifier against the comparison to the ground-truth labels. The latter is quantified by using the weighted F1-score which uses the Hungarian algorithm for the optimal cluster matching and equally weights small or large clusters. As can be seen in the figure, clusterings with high F1-scores also tend to have good out–of–sample accuracy confirming that out-of-sample accuracy is a reasonable proxy for good clustering.

Design and implementation

HAL-x has four major components that allow fast clustering of very large, high-dimensional datasets (Fig 3). First, HAL-x applies the t-SNE algorithm (or another dimensional reduction procedure such as UMAP) to reduce the dimensions on a down-sampled portion of the data. Second, HAL-x uses an approximate nearest neighbors algorithm and kernel density estimation to identify a initial set of “pure clusters” in regions above a specified density threshold. Third, HAL-x defines an extended density neighborhood for each pure cluster, identifying spurious clusters that are representative of the same density maxima and should be merged. Fourth, HAL-x builds a sparsely connected K-nearest neighbor (KNN) graph that connects each cluster with the k most similar clusters. The edges of the graph are labeled with the accuracy of a supervised classifier (i.e. SVM, random forest) that is trained to separate unreduced data points taken from the two clusters in the original high-dimensional feature space. The clusters with the lowest edge-score (lowest classification accuracy) are then merged. An overview of the basic workflow is shown in Fig 3. Runtime analysis and a pseudocode representation of the HAL-x algorithm can be found in S1 Methods.

Fig 3

Overview of the hierarchical agglomerative learning approach.

(a) For high-dimensional inputs, a dimensionality (c.g. UMAP [31], t-SNE, etc.) reduction step is required in order to obtain reliable density estimates. (b) In low-dimensional spaces, density maps can be easily computed. Initial clusters are selected to be the neighborhood of the density modes (see [17]). (c) A k nearest-neighbor graph is constructed by measuring similarity via the out-of-sample accuracy by training classifiers in the original high-dimensional feature space: each node represents an individual cluster and each edge has an associated weight given by the accuracy of the classifier. (d) Nodes are successively merged by pairs following the procedure until a desired out-of-sample accuracy is reached. The end result consists of an interpretable hierarchical classifier and robust clustering assignments. The classifier can be used to predict the labels of new data and potentially identify outliers.

Overview of the hierarchical agglomerative learning approach.

Dimensionality reduction

HAL-x is designed to cluster datasets with up to 100 million points embedded in a 50+ dimensional space (typical datasets collected in mass cytometry measurements). To do this, the data must first be projected into a low-dimensional plane. HAL-x uses the z-score technique to normalize a down-sample of the data. The size of the down-sampling must be adjusted depending on the dimensionality and quality of the dataset. For real experimental CyTOF datasets like those considered below, we have empirically found that using a down-sampled dataset consisting of one to two hundred thousand cells works well. This down-sampling is used to create a low-dimensional embedding, typically UMAP [31] or Fast Fourier Transform-accelerated Interpolation-based t-stochastic neighbor embedding (fitSNE) [32]. t-SNE is especially well-suited for density clustering methods since it preserves local ordination in the data, while repelling points that are far away. A drawback of using a low-dimensional embedding such as tSNE or UMAP is that one must settle on a particular metric (usually Euclidean) to compute the initial similarity matrix before learning the embedding. This is problematic for data with highly correlated features and that have underlying labels invariant under global transformations (e.g. for images: translations and rotations). In order to alleviate this problem and generalize our approach to a broad class of high-dimensional datasets, we can perform principal component analysis which is known to help identifying uncorrelated linear combinations of features [15].

Identification of pure clusters

HAL-x identifies an initial set of “pure”, high-density clusters that form the starting point for agglomerative clustering. This is done by performing density clustering in the low-dimensional embedding space using a variation of the algorithm proposed in [17]. In particular, cluster centers are identified by finding isolated high-density points, with nearby neighbor mapped to the closest cluster center. To estimate local densities, HAL-x performs an approximate nearest neighbors search via the locality sensitive hashing (LSH) search algorithm with random projections. For each point, the number of neighbors is scaled linearly with the number of points in the down-sampled dataset. Once HAL-x has identified the neighborhood of each point, a Gaussian kernel density estimate is used to determine if a point has the highest density in the neighborhood. A full density map is constructed by assigning a Gaussian kernel of bandwidth h to each data point (x ∈ X) and summing all kernels: where . The bandwidth parameter can be thought of as a regularization parameter and sets how wide each kernel should be. We determine this parameter using a maximum likelihood estimation of Eq 1 over a testing set. We start by splitting the samples into a training and a testing set X = (Xtrain, Xtest). Using the training set in order to construct and estimate , we can then compute the negative log-likelihood of the kernel density estimate over the testing set: The bandwidth parameter h can therefore be determined by minimizing Eq 2, which is a simple one-dimensional f function of h. After all centers have been identified, labels are assigned to each data point by simply comparing the relative density gradients between a given point and the cluster centers. After labels have been assigned to each point, if, for a given neighborhood, 99 percent of the cells share the same cluster center, that cluster is considered “pure” and is added to the initial set of pure clusters. Once HAL-x identifies the pure clusters (i.e. performs the initial clustering), it performs a second, more rigorous validation of the point cluster assignment. We use two separate validation protocols to ensure that our initial neighborhoods have sufficiently high density and are not errant groupings that stem from a low density threshold. HAL-x uses the approximate nearest neighbors search algorithm a second time to define an extended neighborhood for each cluster center. This extended neighborhood includes points that are within a more relaxed density threshold than the one used for the approximate nearest neighbor search in the previous step. If there is a significant decrease in purity within the extended neighborhood, the conflicting clusters (representing the same density maxima) are merged to prevent unnecessary overlap.

Supervised linkage

After the initial clustering (i.e. after the identification of the “pure” clusters), HAL-x performs agglomerative clustering by linking clusters that are hard to distinguish. This supervised linkage between clusters is performed in the original high-dimensional feature space. This is done by training supervised learning classifiers (commonly random forests or support vector machines) to distinguish all pairs of clusters from each other. In other words, if there are N pure clusters, we train N(N − 1)/2 classifiers that distinguish between each pair of clusters. HAL-x then builds a minimally connected K-Nearest Neighbors (KNN) graph wherein each edge in the graph connected a cluster to the K clusters that are the most difficult to distinguish from itself (i.e. to the K clusters with the lowest out-of-sample accuracy) [33]. Using the KNN graph, HAL-x then performs a deeper sweep on each pair connected by an edge. This deeper sweep is performed with a larger ensemble of random forest classifiers built on bootstrapped samples of the data, thereby reducing the risk of overfitting. When HAL-x has completed this deeper sweep, the two clusters with the lowest edge-score (classification accuracy) are merged. The KNN graph and the labels are updated accordingly. This process (starting with a deep sweep of all clusters connected by an edge) is repeated until there is only a single cluster that contains all the data points (corresponding a perfect cv-score of 1). The classification ensemble that generated the final merge is stored in the root node of a tree data structure. We chose a tree data structure so that a path can be traced between any pure cluster and the root. Excluding the pure clusters, which act as the leaves, each cluster is a parent for the two child clusters that were merged to create this parent cluster. The classifiers used to separate the child clusters are stored in the parent node, and each edge (separating a parent and a child node) is linked to the accuracy value (cv-score) that led to the merge.

Prediction

Using the tree data structure created during the training process, HAL-x can ultimately predict the multiple cluster labels for millions of unreduced data points. For this purpose, the user selects a minimum score (cv-score), which corresponds to the accuracy of the cross-validated supervised models. Every predicted label will correspond to a cluster that can be separated from all other clusters with a mean accuracy score greater than the cv-score value. A higher cv–score will result in fewer clusters, whereas a lower cv–score will result in a large number of clusters that are only slightly different from neighboring clusters (See Fig 4B and 4C). The default cv-score is 0.50: random classification. A cv-score of 0.50 will typically consider the full set of pure clusters when predicting the labels.

Fig 4

Clustering based on out-of-sample error.

Clustering based on out-of-sample error.

(a) Clusters generated with a cv-score of 0.97. (b) Graph illustrating out-of-sample error for each pair of clusters. There are no pairs of clusters with an out-of-sample error that is greater than 0.03 (accuracy value of 0.97). (c) Clusters generated with a cv-score of 0.98. Clusters 1 and 4 have been merged because the out-of-sample error was greater than 0.02. (d) Graph illustrating accuracy for each pair of clusters. There are no pairs of clusters with an out-of-sample error that is greater than 0.02 (accuracy value of 0.98). Algorithm 1 HAL-x Prediction Input: data matrix x, tree tree, float cvScore parent = tree.root for point in x do while True do if parent.edgeScore > cvScore then label = parent.ensemble.predict(point) parent = tree.getNode(label) else point.label = label break end if end while end for To make the predictions and thereby apply the clustering across all datasets, HAL-x first separates the data using the classifiers corresponding to the final merge that created the cluster containing all the data. Subsequent predictions are made using the classifiers from the children of the predicted cluster, further separating the data based on increasingly nuanced patterns. This process continues until the next predicted label would result in a cluster that is connected to the parent with an edge-score of less than cv-score. With the tree approach utilized here, a range of cv-score values can be efficiently checked for the same model, generating different sets of clusters depending on the level of population specificity desired by the user. Algorithm 1 highlights the simplicity of the HAL-x prediction technique via the cv-score parameter.

Results

Here, we present the results of testing HAL-x on both synthetic and real-world datasets. First, we benchmarked HAL-x on the FlowCAP I dataset. We then generated multiple synthetic datasets containing Gaussian populations. We used these datasets to compare the scalability of HAL-x to Parc, Phenograph, FlowSOM, and HDBSCAN. We also tested HAL-x on a large, real-world mass cytometry dataset [4, 34], custom–generated at the NIH. This dataset contains 25.9 million single-cells taken from healthy individuals and from patients with Lupus [35]: these samples were profiled with an antibody panel quantifying 34 surface epitopes. After clustering, we performed supervised classification to determine if HAL-x clusters could be used to discriminate clinical statuses from leukocyte profiles: this provides a foundation for future experimental work and drug/biomarker discovery.

Levine Benchmark datasets

We first tested HAL-x on two benchmark datasets from [5] to demonstrate the validity of HAL-x clusters: scalability is meaningless if the outputs are random. The first dataset is a flow cytometry dataset containing 81,747 human bone marrow mononuclear cells (BMMC) cells from 1 healthy tissue sample. These BMMC cells have 13 features corresponding to surface epitopes and a label corresponding to phenotype. The second dataset is a mass cytometry dataset containing 104,184 human bone marrow mononuclear cells (BMMC) cells from 2 healthy tissue samples. These BMMC cells have 32 features corresponding to surface epitopes and a label corresponding to phenotype. These phenotypes have been hand defined by flow cytometry experts and represent our general understanding of leukocytes in bone marrow. We used HAL-x to cluster these single–cell datasets and compared the multiclass F1-scores with those achieved by three other clustering algorithms: Parc, Phenograph, and HDBSCAN (Table 1).

Table 1

F1 Measure of clustering accuracy for diverse methods, applied onto CyTOF datasets from [5].

Method	F1 for Levine-13	F1 Levine-32
HAL-x	0.61	0.77
Parc	0.50	0.73
Phenograph	0.52	0.65
HDBSCAN	0.21	0.15

We see from Table 1 that HAL-x significantly outperforms Parc, Phenograph, and HDBSCAN on both these datasets, showing HAL-X is competitive with existing techniques in terms of accuracy.

Synthetic datasets

HAL-x is designed as a convenient, accessible tool for biomedical researchers for this reason we wanted to assess how well HAL-x preformed on large datasets with limited computational resources. To do so, we generated a synthetic dataset with 10 million data points, the typical size of a dataset in single-cell biology, and tested HAL-x on an 8-core laptop computer—the typical computing resources available to a bench researcher. High-performance computing was not used in this study. Within this computing environment, we compare HAL-x to the three algorithms widely used for fast clustering of high-dimensional data: Parc, Phenograph, and HDBSCAN. We generated datasets with 103, 104, 105, 106, and 107 synthetic data points, each with 30 features. The size and dimensions of these synthetic datasets were chosen to match the structure of mass cytometry data, which typically contains approximately 30 features (antibodies) and more than 107 single cells. For HAL-x and HDBSCAN (algorithms which have prediction capabilities), we defined a 1 percent downsample to train on the datasets containing from 104 to 106 data points. For the dataset with 103 points, we used a 10 percent downsample because 100 data points was insufficient for the perplexity of the fitSNE algorithm. We used a 0.5 percent downsample on the dataset with 107 points so the memory required by the fitSNE algorithm would not exceed available resources in our computing environment. We do not perform any dimensionality reduction on the test dataset: without significant computing resources, this is typically infeasible for large datasets. Therefore, we are unable to reduce the training set for HDBSCAN, which requires both training and test data to have the same dimension. Parc, Phenograph, and FlowSOM cannot predict the cluster labels of new data points. For these methods, we attempted to learn the clustering model on the entire dataset (without downsampling). We used the default parameters for all the methods considered in this study. In Fig 5, we show the scalability of HAL-x compared to 4 algorithms that are commonly used for single-cell data clustering. We see that all 5 algorithms are relatively equal at 103 and 105. At 105, Phenograph is significantly slower than HAL-x, HDBSCAN,FlowSOM, and Parc. At 106, Parc,FlowSOM, and Phenograph exceed the available memory within our computing environment. HDBSCAN can train a model on 500,000 cells and predict on 107 cells, but the runtime is immensely slower than HAL-x. This observation is consistent with the predictive mechanism of HDBSCAN, which relies on nearest neighbour search rather than a set function learned by the supervised classifiers used in HAL-x.

Fig 5

Comparison of run times for training and label prediction on synthetic datasets of increasing sizes (comparing HAL-X with other clustering algorithms routinely used with cytometry data -Parc, FlowSOM, HDBScan & Phenograph).

The graybox indicates that the memory of laptop was exceeded for large datasets when using other algorithms than HAL-X.

Comparison of run times for training and label prediction on synthetic datasets of increasing sizes (comparing HAL-X with other clustering algorithms routinely used with cytometry data -Parc, FlowSOM, HDBScan & Phenograph).

The graybox indicates that the memory of laptop was exceeded for large datasets when using other algorithms than HAL-X.

Lupus dataset

Finally, to illustrate the scalability and tunability of HAL-x, we tested HAL-x on a real-world mass cytometry datasets containing 25.9 million individual cells analyzed from 75 peripheral blood mononuclear cells (PBMC) samples and 75 polymorphonuclear (PMN) samples [36]. 49 of these samples were from Lupus patients and 26 of these samples were from healthy individuals. We designed a 38-marker mass cytometry panel with a focus on neutrophils as well as general phenotyping of blood leukocytes (see S1 Methods, S1 and S2 Figs for experimental details). At the onset of this study, we assumed no prior knowledge regarding which populations whose differential frequencies would constitute good biomarkers for lupus and potential targets for immunotherapy. Thus, we sought to generate 5 separate sets of clusters, each with a different level of population specificity, ranging from very rare to quite broad. In under 1 hour, HAL-x generated 5 separate clusterings for the 25.9 million cells. Fig 6B and 6C presents the linear relationship between the cv-score parameter and the number of clusters, demonstrating the ability of HAL-x to widely vary the number of clusters using the same model.

Fig 6

Analysis of Lupus patients’ blood profile using single-cell mass cytometry and HAL-x.

Analysis of Lupus patients’ blood profile using single-cell mass cytometry and HAL-x.

(A) The pipeline for profiling immune cells from Lupus patients and healthy donors, beginning with blood collection, continuing with mass cytometry, and culminating with HAL-x processing. (B) t-SNE visualizations of clusterings generated by HAL-x which correspond to various levels of population specificity (cv-score). (C) Heatmap of the average expression levels for the 19 leukocyte clusters defined by HAL-x for a cv-score of 0.7. The Cell type annotations were made hand. (D) A graph of AUC (i.e. Area under the Receiver Operating Curve (ROC)) for different cv-scores shows that the most important clusters with discriminatory power are not observed for high CV-scores. The best cluster identified by HAL-x (cluster #42 for cv-score of 0.7), separates individual neutrophils from Lupus and Healthy patients with an AUC of 0.82. (E) Markers on neutrophils (cluster #42) that can be used to distinguish blood samples from Lupus patients and from healthy donors. Note how subtle the difference is at the level of individual markers. For each HAL-x cluster (across our 5 different clustering depths), we trained a random forest classifier to separate healthy cells and Lupus cells. We discovered multiple clusters of PBMC cells and neutrophils wherein the healthy cells and lupus cells were classified with an AUC score of 0.82. The altered phenotypes of these cell populations (Fig 6E) can be considered novel biomarkers for Lupus. In particular, Cluster #42 for cv-score of 0.7 is a subpopulation of mature neutrophils (defined as CD66ace+CD66b+) whose levels of expression for CXCR1, CD15, CXCR2, IgA, CD66ace, CD66b, CD10, CD45RO, CD24 and CD16 best distinguishes between Lupus patients and Healthy donors (Fig 6D and 6E). These populations will be the subject of wet-lab experimentation seeking to highlight their functional relevance for Lupus pathology and to identify new molecular/cellular targets for lupus immunotherapies. Moreover, our results on the lupus dataset show the importance of varying the population specificity when searching for key populations in high-dimensional data. We observed here that several of the key cell populations were only identified when the cv-score was 0.7 or lower (deeper clustering). When the cv-score was higher (a more broad clustering), these important clusters are merged with other populations. This resulted in lower accuracy values and the loss of nuanced phenotypes which may have clinical significance. Our results show the importance of testing multiple clustering with varying depths. For example, if we had fixed the cv-score at 0.8, we would have missed several biomarkers and lost valuable information about the cellular landscape. HAL-x offered both the robustness, tunability and scalability to match the biological complexity of our Lupus dataset.

Availability and future directions

We make HAL-x publicly available at: https://pypi.org/project/hal-x/. The emergence of inexpensive platforms for large-scale single-cell analysis in biomedical applications is presenting new challenges to researchers attempting to interpret these immense, high-dimensional datasets. HAL-x integrates density clustering (reduced data) and supervised classification (unreduced data) to generate clusters that can be projected across tens of millions of single-cells. The success of HAL-x on our lupus datasets shows the importance of clustering algorithms that: 1) can easily predict points for new data and 2) can interact with raw data. These are vital components in the time/memory efficiency achieved by HAL-x. Moreover, the simple HAL-x prediction algorithm (Algorithm 1) allows for one model to generate multiple sets of labels corresponding to varying levels of population specificity—a valuable advantage for biomedical researchers who need to cluster as deep as possible to identify key populations (biomarkers) of biological relevance and as shallow as possible to retain statistical significance in these biomarkers. The success of HAL-x clustering when used for supervised classification tasks (i.e.identifying a novel cluster of neutrophils that distinguishes lupus patients from their healthy counterparts) suggests that our approach has a promising role in biomarker derivation and drug discovery. For example, the trends within specific subpopulations can be used to predict targets for new Lupus immunotherapies. More generally, the ability of HAL-x to generate a clustering model that can be propagated and tuned across multiple datasets will standardize the ability of biological researchers to analyze vast amount of data and to communicate their results.

Supplementary methods.

(PDF) Click here for additional data file.

Panel of metal-coupled antibodies used to profile the blood samples from Lupus patients and healthy donors.

(PDF) Click here for additional data file.

Raw .fcs file processing and gating strategy.

Left: gating for live cells (excluding Cisplatin-positive dead cells and EQ4 calibration beads. Right: gating for CD45+DNA(2n) singlet leukocytes. (PDF) Click here for additional data file.

Dynamical dashboard generated by HAL-x in order to inspect the different clusters and their hierarchy.

The dashboard has 3 panels. (left-side) The hierarchical relationship between the clusters. (top right side) The embedding map with the clustering labels. (bottom right) When choosing a cluster, this shows the profile of that clusters in terms of the original features. (PDF) Click here for additional data file.

Benchmark datasets used in figures of main text.

The red/blue/green coloring represents the ground-truth clustering labels when generating the datasets. The datasets are taken directly from scikit-learn clustering benchmark page. See https://scikit-learn.org/stable/modules/clustering.html for the exact definitions of the datasets. (PDF) Click here for additional data file.

HAL-x run for N = 106 synthetic data using a) t-SNE and b) UMAP for the primary embedding.

The results are very similar although we found that the UMAP approach ran approximately 50% faster in this instance. (PDF) Click here for additional data file. 16 Dec 2021 Dear Dr. Altan-Bonnet, Thank you very much for submitting your manuscript "Scalable hierarchical clustering with supervised linkage methods, for rapid and tunable single-cell analysis" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Please ensure that the reviewers' comments are thoroughly addressed. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, James R. Faeder Associate Editor PLOS Computational Biology Manja Marz Software Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The manuscript by Anibal, Day, et al. introduces a new data clustering method (HAL-x) that is well-suited for studying large, high-dimensional data sets like those generated by mass cytometry. Data clustering is an important and timely topic given the emergence of large data sets in biology that can contain tens of millions of data points in a feature space with 40 or more dimensions. There have been a number of data-clustering efforts in this area, which the authors review in their paper. The authors overcome shortcomings of existing works by creating a workflow involving nonlinear dimensionality reduction, density clustering, and supervised linkage methods to create a cluster hierarchy on raw data. This work advances the field through the computational efficiency (scalability) of the method and the ability to readily generate multiple clusterings at varied depths. Major concerns 1. In the “Related Work” section, the authors describe a shortcoming when using nonlinear embeddings like t-SNE and then performing clustering in the low-dimensional space: Namely, that the low-dimensional representations can compress the data and lead to spurious clusters. The first step of the HAL-x method uses t-SNE on a down-sampled portion of the data. It would be useful for the authors to explore how the initial embedding in low-dimensional space impacts the subsequent clustering by their algorithm. 2. The authors discuss (page 7) the possibility of preprocessing the input using an encoder “if necessary.” How can one decide if this step is necessary? 3. Regarding Figure 1, I was not convinced that “clusterings with high F1-scores also tend to have good out-of-sample accuracy confirming that out-of-sample accuracy is a reasonable proxy for good clustering.” Many clusterings with intermediate and even low weighted F-scores also have good out-of-sample accuracy. I encourage the authors to explain/explore this in more detail. I am not sure that I follow the logic/conclusion, or perhaps I am misunderstanding aspects of the figure. 4. The text in this section (“Out-of-sample error…”) refers to supplemental materials that I cannot find. 5. For the synthetic data set: How were the points distributed? How well did the algorithms perform on the data (i.e., how well did they cluster, assuming structured data was provided)? It seems like there is additionally opportunity here to assess the performance of the method. 6. I recognize that this is a methods-oriented paper. However, I think the paper would be strengthened by expanding the discussion of results for the Lupus dataset. For example, “if we had fixed the cv-score at 0.8, we would have missed several biomarkers and lost valuable information about the cellular landscape.” The authors don’t have to dramatically expand the scope of their paper, but this really leaves me hanging as a reader. I would like to know more about the biomarkers and/or what types of clusters are merged as cv-score increases. Minor comments - Some terms are not defined in Eqns 1 and 2. - Page 9: What does “consider the full set of pure clusters” mean? - Is FlowCAP I the same as the Levine benchmark datasets? - Based on the wording, it was hard to tell if the lupus data was generated for this paper or was a previously published data set. Typos “tend to have out good of sample” (page 5) “that are only slightly differences” (page 8) “with 10 millions data” (page 11) “constitute a good biomarkers” (page 12) Reviewer #2: In this paper, Anibal et al. describe a novel method for data clustering, especially high-dimensional data such as that produced by CyTOF and similar experimental methods. Their method works by reducing the dimensionality of the data, developing a base set of clusters on this space, and then refining them. One of the key principles in this work is that a good clustering algorithm should produce clusters that generalize well to new data. The authors test their method against other existing approaches and find that their approach has advantages in both speed and accuracy. This paper makes a valuable contribution to the literature on practical clustering methods. However, there are a number of points where the paper could be clarified or where important concepts are underexplored. In general, some details of the methods and simulations are unclear or have not been given, and the examples on real data could be explored further. My detailed comments are below: 1. I ran into several problems attempting to run the authors’ software, at least if this is run outside of Anaconda. In order to use the software, users must have the FFTW package installed because it is a requirement for fitsne. However, this is not currently mentioned in the installation notes, and a naïve installation through pip will fail without FFTW. The authors should note the dependency on FFTW and point to where it can be installled. The posted example also fails with the current version of sklearn due to changing conventions in the make_blobs function. I recommend that the authors change that line of the example to: “X,y = make_blobs(n_samples=10000,n_features=12,centers=10)” in order for the function call to complete correctly. The output in the example is also not easy to parse for a nonexpert. There is no output visualization, nor is there any apparent test quantifying goodness of fit (or some other metric) for how well the algorithm is working. I’m having trouble debugging this fully, but it looks like the visualization fails because at some point HAL calls for a plot using javascript, but this is run through a Python 2.7 call instead of Python 3 and the program fails to find the http.server module. I encountered these errors on a Mac with up to date versions of Python and all associated libraries. 2. One of the main points of emphasis in the paper is the ability to scale the sensitivity of the analysis to obtain more coarse or granular clusters (point 3 on page 3). The authors present real and simulated examples demonstrating this principle, but it would be helpful if some more detail was included, especially for the real data example. The simulated example presented in Figure 3 (note: no information is provided about the source of the data) gives a sense for how this works, but it appears that the definition of clusters can be quite sensitive to the chosen accuracy value. A small shift in the cv-score cutoff from 0.97 to 0.98 causes two large clusters to merge together. Of course, part of this may be due simply to the nature of thresholds, but this is not easy to see from the example. It would be helpful for future users of this method if an example were provided where a wide range of values for the accuracy cutoff is explored in a smooth way. The real data example provides something like this (Figure 5C), but it is difficult to appreciate how clustering changes as the threshold shifts given the large steps in cv-score values. It would also be helpful if the authors could present more detail on their results in the main text. At present, in this example, little data is presented beyond broad, qualitative statements (e.g., “several of the key cell populations were only identified when the cv-score was 0.7 or lower”; which populations are these, and how is it inferred that these populations are important?). Practically, it would also be helpful if such a worked example was provided together with the code, so that users of the algorithm will be well-prepared to perform this analysis themselves. 3. In the section on tests on synthetic data, the authors mention the size of the data set that they consider, but nothing is stated about how the data is structured. This is in principle an important choice, especially given that this section gives a comparison between different methods on the data. The authors should describe this in detail, and ideally the test data should also be available online (either as part of the repository for the code, in a database such as Zenodo, or as part of the supplementary material for this paper). 4. On page 7, the authors state that “we can also preprocess the input using an encoder if necessary” in place of t-SNE. At present, this statement is very vague, and it is not clear when the authors use t-SNE and when they use an encoder. How would the authors suggest that users of their software decide which approach to take for a particular data set? Can the consequences of different choices be explored on an example data set? How does this choice affect the run time? 5. Similarly, the authors use multiple classifiers for different data sets. In the examples in Figure 1, the authors use support vector machines, while for the Lupus data set they use random forests. Why? On page 4, the authors state that “any reasonably powerful supervised learning algorithm” can be used, but again this statement is rather vague and leaves users to wonder whether different algorithms may be appropriate for different circumstances, or whether these approaches are truly interchangeable. 6. The authors claim that the accuracy of predicting clustering for out of sample data is a good proxy for what constitutes good clustering, and I agree with their intuitive rationale. However, this is a bit difficult to see in Figure 1 because there are quite a few clusterings with high out of sample accuracy and low F-scores. It appears that this might be due to plotting multiple different types of data sets on the same plot, but this is still somewhat difficult to see. How well-correlated are the out of sample accuracy and F-score for each type of data? Also, in the materials that I have available, the data sets themselves are not described. It would be very helpful if this information was included in the supplementary information. In particular, I am somewhat confused by the “no_structure” data set – how does one evaluate clustering accuracy on a data set that has no intrinsic structure? 7. Despite its frequent use as a metric of clustering quality in the paper, the (weighted) F1-score is never explicitly defined. Minor comments: 1. On page 4, the authors write: “…there exist[s] specialized methods developed specifically for high-dimensional biological data such as Phenograph, others, etc.” Here, Phenograph is the only method that is named. Are there other methods that should be listed here? 2. In Figures 4 and 5 many of the fonts are small and difficult to read. The clustering plots in Figure 5B are so small that it is very difficult to examine them in detail. 3. Given that the HAL-x software can produce visualizations, it would be helpful for users if some of these visualizations were included in plots in the paper. This requirement may already be satisfied – as explained above, I was not able to get plots to work with the software, and so I was not able to examine the output directly. 4. It seems that equation 2 should be described as the negative log-likelihood (or minus the log-likelihood) instead of simply the log-likelihood. Reviewer #3: In this paper, the authors proposed HAL-x, a new hierarchical density clustering algorithm that can be scalable to a large data. HAL-x is especially designed for flow cytometry data, a large single cell data that often suffers from the scalability problem. While HAL-x has multiple interesting components, there are still multiple components that are not sufficiently clarified. I provide my comments in detail below. 1. The degree of down-sampling seems to be data-dependent. However, this is not a trivial task. For example, in the synthetic data analysis, the authors use very different values for different datasets, e.g., 10% vs. 0.5%. Can the authors provide some guideline to determine the degree of down-sampling? 2. fitSNE is proposed to be used for the dimension reduction purpose. Recently UMAP became more popular compared to t-SNE-based approaches. Can the authors provide justification to use fitSNE rather than UMAP? 3. It is proposed to map the data onto a two-dimensional plane using fitSNE. While 2D is great for the visualization purpose, it might not be optimal for the clustering purpose. Can the authors provide justification of using 2D or consider to use other dimensions? 4. In Equation 1, what is the mathematical meaning of arrow (->)? Convergence? 5. In the kernel density estimation, kernel and bandwidth are two critical choices and they have significant impact on the output. In the case of kernel, while Gaussian kernel is popular, Epanechnikov kernel is more popular and mathematically considered to be optimal. Similarly, there are various bandwidth selectors, including plug-in methods, cross-validation methods, etc. For example, Sheather & Jones method is a popular choice. Can the authors provide justification of using Gaussian kernel and a cross-validated MLE? 6. In Prediction step, the cv-score cutoff is a critical tuning parameter, which determines the number of clusters. While I acknowledge that hierarchical clustering is the power of HAL-x, biologists often need to determine a specific number of clusters for downstream analyses. Can the authors provide any guideline to determine the number of clusters? Is the default value (0.5) recommended for the general purpose? 7. For flow cytometry data, FlowSOM is the most popular algorithm for cell clustering and multivariate t-mixture is also likely popular. However, these popular algorithms are not included in the benchmarking. Please consider to include them in the benchmarking. 8. I cannot find the Author Summary section. Please provide this section in the manuscript. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No: The source code is available. However, details about synthetic data sets used in Figures 1 and 4 are missing. Reviewer #2: No: Test data and code used to evaluate different methods does not appear to be available at present. Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 8 Jun 2022 Submitted filename: 20220307 Reviews @ PlosCompBio.docx.pdf Click here for additional data file. 2 Jul 2022 Dear Dr. Altan-Bonnet, We are pleased to inform you that your manuscript 'HAL-X: scalable hierarchical clustering for rapid and tunable single-cell analysis' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, James R. Faeder Associate Editor PLOS Computational Biology James Faeder Associate Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed my concerns. Reviewer #2: The authors have responded to my original comments, and I have no further questions. Reviewer #3: I believe my previous comments were well addressed by the authors and this manuscript is acceptable to be published. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No 20 Sep 2022 PCOMPBIOL-D-21-01438R1 HAL-X: scalable hierarchical clustering for rapid and tunable single-cell analysis Dear Dr Altan-Bonnet, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofi Zombor PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

17 in total

1. Stability-based validation of clustering solutions.

Authors: Tilman Lange; Volker Roth; Mikio L Braun; Joachim M Buhmann
Journal: Neural Comput Date: 2004-06 Impact factor: 2.026

2. Spectral embedded clustering: a framework for in-sample and out-of-sample spectral clustering.

Authors: Feiping Nie; Zinan Zeng; Ivor W Tsang; Dong Xu; Changshui Zhang
Journal: IEEE Trans Neural Netw Date: 2011-09-29

3. Replicating Cluster Analysis: Method, Consistency, and Validity.

Authors: J N Breckenridge
Journal: Multivariate Behav Res Date: 1989-04-01 Impact factor: 5.923

4. Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding (ACCENSE).

Authors: Karthik Shekhar; Petter Brodin; Mark M Davis; Arup K Chakraborty
Journal: Proc Natl Acad Sci U S A Date: 2013-12-16 Impact factor: 11.205

5. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data.

Authors: Lukas M Weber; Mark D Robinson
Journal: Cytometry A Date: 2016-12-19 Impact factor: 4.355

6. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum.

Authors: Sean C Bendall; Erin F Simonds; Peng Qiu; El-ad D Amir; Peter O Krutzik; Rachel Finck; Robert V Bruggner; Rachel Melamed; Angelica Trejo; Olga I Ornatsky; Robert S Balderas; Sylvia K Plevritis; Karen Sachs; Dana Pe'er; Scott D Tanner; Garry P Nolan
Journal: Science Date: 2011-05-06 Impact factor: 47.728

7. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis.

Authors: Jacob H Levine; Erin F Simonds; Sean C Bendall; Kara L Davis; El-ad D Amir; Michelle D Tadmor; Oren Litvin; Harris G Fienberg; Astraea Jager; Eli R Zunder; Rachel Finck; Amanda L Gedman; Ina Radtke; James R Downing; Dana Pe'er; Garry P Nolan
Journal: Cell Date: 2015-06-18 Impact factor: 41.582

8. Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Authors: Andrew Butler; Paul Hoffman; Peter Smibert; Efthymia Papalexi; Rahul Satija
Journal: Nat Biotechnol Date: 2018-04-02 Impact factor: 54.908

9. Defective glycosylation and multisystem abnormalities characterize the primary immunodeficiency XMEN disease.

Authors: Juan C Ravell; Mami Matsuda-Lennikov; Samuel D Chauvin; Juan Zou; Matthew Biancalana; Sally J Deeb; Susan Price; Helen C Su; Giulia Notarangelo; Ping Jiang; Aaron Morawski; Chrysi Kanellopoulou; Kyle Binder; Ratnadeep Mukherjee; James T Anibal; Brian Sellers; Lixin Zheng; Tingyan He; Alex B George; Stefania Pittaluga; Astin Powers; David E Kleiner; Devika Kapuria; Marc Ghany; Sally Hunsberger; Jeffrey I Cohen; Gulbu Uzel; Jenna Bergerson; Lynne Wolfe; Camilo Toro; William Gahl; Les R Folio; Helen Matthews; Pam Angelus; Ivan K Chinn; Jordan S Orange; Claudia M Trujillo-Vargas; Jose Luis Franco; Julio Orrego-Arango; Sebastian Gutiérrez-Hincapié; Niraj Chandrakant Patel; Kimiyo Raymond; Turkan Patiroglu; Ekrem Unal; Musa Karakukcu; Alexandre Gr Day; Pankaj Mehta; Evan Masutani; Suk S De Ravin; Harry L Malech; Grégoire Altan-Bonnet; V Koneti Rao; Matthias Mann; Michael J Lenardo
Journal: J Clin Invest Date: 2020-01-02 Impact factor: 14.808

10. PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells.

Authors: Shobana V Stassen; Dickson M D Siu; Kelvin C M Lee; Joshua W K Ho; Hayden K H So; Kevin K Tsia
Journal: Bioinformatics Date: 2020-05-01 Impact factor: 6.937