Literature DB >> 24677621

SWIFT-scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets, part 1: algorithm design.

Iftekhar Naim¹, Suprakash Datta, Jonathan Rebhahn, James S Cavenaugh, Tim R Mosmann, Gaurav Sharma.

Abstract

We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems.

Entities: Chemical Disease Gene Species

Keywords: Gaussian mixture models; automated multivariate clustering; ground truth data; rare subpopulation detection; weighted sampling

Mesh：

Year: 2014 PMID： 24677621 PMCID： PMC4238829 DOI： 10.1002/cyto.a.22446

Source DB: PubMed Journal: Cytometry A ISSN： 1552-4922 Impact factor: 4.355

Introduction

Flow cytometry (FC) has become an essential technique for interrogating individual cell attributes with a wide range of clinical and biological applications 1– 4. The goals of FC analysis are to identify groups of cells that express similar physical and functional properties and to make biological inferences by comparing cell populations across multiple datasets. The massive size and dimensionality of modern FC data pose significant challenges for data analysis (∼106 cells, >35 dimensions in some instruments). FC data have traditionally been analyzed manually by visualizing the data in bivariate projections. This manual analysis is subjective, time consuming, can be inaccurate in case of overlapping populations, and scales poorly with increasing number of dimensions. Moreover, many discriminating features present in the high-dimensional data may not be distinguishable in 2D projections. As a result, automated multivariate clustering has become highly desirable for objective and reproducible assessment of high dimensional FC data. Recently several methods have been proposed, which can be broadly classified into two categories: (a) nonprobabilistic hard clustering 5– 8 and (b) probabilistic soft clustering 9– 14. Hard clustering, which assigns each cell to one of the possible clusters, is likely more familiar to users of manual gating and is also essential for cell sorting. Soft probabilistic clustering on the other hand determines, for each cell, a probability assignment distribution over the full set of clusters, thereby allowing for overlapping clusters. Analysis of FC data seeks to identify biologically meaningful cell subpopulations1 from per cell measurements of antigen expression correlates measured via a set of flurophore tags. Typical datasets exhibit a high dynamic range for the number of events in each subpopulation, i.e., within a dataset, there are subpopulations with a large percentage (10% or higher) of the total events and subpopulations with a small percentage of the total events (0.1% or lower). The small subpopulations are often biologically significant and therefore important to resolve. Distinguishing these small subpopulations is challenging because, in the measurement space, they often consist of observations that form skewed, non-Gaussian distributions that appear merged as “shoulders” of larger subpopulations with which they overlap. To meet these challenges, we propose a soft mixture-model based framework “SWIFT” (Scalable Weighted Iterative Flow-clustering Technique), which scales to large FC datasets while preserving the capability of identifying small clusters representing rare subpopulations. SWIFT differs algorithmically from prior methods in four main aspects: (a) the mixture modeling is performed in a scalable framework enabled by weighted sampling and incremental fitting, allowing SWIFT to handle significantly larger datasets than alternative mixture model implementations; (b) the weighted sampling is explicitly designed to allow resolution of small potentially overlapping subpopulations in the presence of a high dynamic range of cluster sizes; 3(c) the algorithm includes a splitting and merging procedure that yields a final mixture model where each component is unimodal but not necessarily Gaussian; and (d) the determination of the number of clusters K is performed as an integral part of the algorithm via the intuitively appealing heuristic of unimodality. Parts of the SWIFT framework have been previously presented in their preliminary form in 15. Recently, the detection of rare cell subpopulations has also been independently addressed in Ref. (14 using a hierarchical Dirichlet process model to solve the dual problems of finding rare events potentially masked by nearby large populations and to provide alignment of cell subsets over multiple data samples. Compared with 14 SWIFT achieves better resolution of rare populations (data presented in companion manuscript 16). Also, the weighted iterative sampling and incremental fitting algorithmic approach in SWIFT strategy scales better to large datasets allowing the algorithm to operate on conventional workstations instead of requiring specialized GPU hardware. SWIFT is available for download at http://www.ece.rochester.edu/projects/siplab/Software/SWIFT.html.

Problem Formulation

To describe our methodology in precise terms, we consider the following mathematical formulation for our problem: N independent events, each belonging to one of several classes that are unknown a priori, generate a corresponding set of N, d-dimensional observations. We will assume column vectors as our default notational convention so that each x is a d × 1 vector. Given the d × N input dataset = [x1, x2, …, x], we wish to estimate the number of distinct classes and the class for each of the N events. We refer to the estimated classes as clusters and denote by K the total number of clusters. In the FC context, the events correspond to distinct triggering of FC measurements, usually caused by individual cells,2 and the classes correspond to biologically meaningful cell subpopulations. For FC measurements, it is common for a given region of the d-dimensional observation space to contain a significant number of observations from different subpopulations. With some abuse of terminology, in such cases, we say that the corresponding subpopulations, or classes, overlap. Because of the overlaps between classes, it is appropriate to assign soft memberships, i.e., allow an event to belong to each of the K clusters with associated probabilities (or from an alternative perspective, to allow fractional memberships in each of the K clusters). Thus, our goal is to determine a membership probability matrix, where ω represents the probability that event i belongs to cluster j, for 1 ≤ i ≤ N and 1 ≤ j ≤ K, and for all 1 ≤ i ≤ N. A natural way to model the data in this setting is as a K-component mixture model. Specifically, we assume the given dataset represents N independent observations of a d-dimensional random variable X, that follows a K-component finite mixture model, whose probability density is given by: where is the probability density function of the j-th mixture component having parameters and mixing coefficient π ( and). Our goal is to estimate the parameter vector such that Θ maximizes the likelihood of the given data and also the density function in some parametric form. Once the mixture model parameter vector Θ is estimated, soft clustering can be performed by estimating the posterior membership probabilities using Bayes’ rule, viz., The finite mixture model therefore provides a framework for performing soft clustering in a principled manner, as has been done for a variety of problems 17,18.

SWIFT Algorithm

Pragmatic considerations of complexity for the massive datasets encountered in FC motivated our choice of functional form for. Parameter estimation can be performed much more efficiently for Gaussian mixture models (GMMs) than for alternative models such as mixtures of skewed Gaussians or skewed t-distributions that allow a greater flexibility for modeling naturally occurring (e.g., FC) distributions, for a given number of components K. However, the value of K is, in truth, arbitrary and cannot be determined apart from external heuristic considerations. Because a wide class of distributions can be closely approximated by using sums of Gaussians 19,20, we address non-Gaussianity of common FC data by using a larger number of Gaussians ( > K) and allowing multiple Gaussians to represent a single non-Gaussian cluster. In SWIFT, the probability density of X is approximated by fitting a component ( ≥ K) GMM, and each density component f in Eq. 1 corresponds to a combination of one or more of these Gaussian components. Formally, the probability density is approximated as: where is the multivariate Gaussian distribution with mean µ, covariance matrix Σ, and mixing coefficient α. We seek to estimate the parameter vector of the GMM,. After obtaining, we combine Gaussian mixture components (g) to represent the mixture components f of the general mixture model. Specifically, if the j-th mixture component f is a combination of the l Gaussians with indices, we obtain the parameters, such that, and. Observe that the model in Eq. 3 represents a finite mixture model 17, where each individual mixture component is a combination of several Gaussian components. The number of Gaussians in Eq. 3 should be determined so as to provide an adequate approximation to the observed distributions. Specifically, it should provide enough resolution to identify rare subpopulations commonly of interest in FC data analysis, where it is often desirable to resolve subpopulations including 0.1% or fewer of the total events in a “background” of other larger subpopulations accounting for 10% or more of the total events. Intuitively, we expect that multimodal distributions do not correspond to a single subpopulation. All these considerations motivated the SWIFT algorithm, which consists of three main phases shown schematically in Figure 1a: an initial GMM fitting using K0 components; a modality based splitting stage that splits multimodal clusters and results in ≥ K0 Gaussian components in Eq. 3; and the final modality-preserving merging stage resulting in the K ≤ component general (not necessarily Gaussian) mixture model of Eq. 1, allowing representation of subpopulations with skewed but unimodal distributions as individual clusters. The individual phases are described in detail in the following subsections.

Figure 1

The SWIFT algorithm: (a) Overall workflow and (b) Weighted iterative sampling.

Scalable GMM Fitting Using Expectation Maximization

Traditionally, parameter estimation for GMMs is done using the Expectation Maximization (EM) algorithm 21, but the EM algorithm is computationally expensive for large FC datasets (e.g. events, ∼102 Gaussian components, and d > 20 dimensions). Each EM iteration requires operations, and is therefore prohibitively slow. Moreover, FC datasets tend to show high dynamic ranges in subpopulation sizes. The EM algorithm often fails to isolate such small overlapping subpopulations, because of slow convergence rate. SWIFT’s weighted iterative sampling addresses these twin challenges by scaling the EM algorithm to large datasets, while allowing better detection of small subpopulations. The parameter estimates are refined by performing a few iterations of the Incremental EM (IEM) 22 algorithm on the entire dataset. An optional scalable ensemble clustering step improves the robustness of clustering in a scalable manner. To make the description self-contained, we present a brief overview of the EM and the IEM algorithms in the context of GMM fitting in the Supporting Information (Section A).

Weighted iterative sampling based EM

Algorithm 1 and Figure 1b summarize the weighted iterative sampling based EM procedure used in SWIFT. Motivation and key steps are highlighted next. An intuitive way to reduce computational complexity for large datasets is to work on a smaller subsample drawn from the dataset. When the mixing coefficients (α) exhibit a high dynamic range, a uniform random sample drawn from the dataset usually represents the large subpopulations with reasonable fidelity but is inadequate for resolving rare populations, for which parameter estimation is markedly poor when operating on a uniform subsample. We start with a uniform random sample containing n observations drawn from. First, a K0 component GMM is fitted to. Next, we fix the parameters of the p (a user defined parameter) most populous Gaussians and reselect a sample of n observations from, drawn according to a weighted distribution, where the probability of selecting a data point equals the probability that the data point does not belong to the already fixed clusters. Specifically, let F be the set of Gaussian components whose parameters have already been fixed and γ3 be the posterior probability that x belongs to the jth Gaussian component. Then, in the next iteration, we resample according to a weighted distribution where the probability of selecting each point x is. The EM algorithm is applied on the new sample with random reinitialization of the Gaussian components that are not fixed yet (the means are set to randomly chosen observations from the new sample). In each E-step, we estimate posterior probabilities (γ) for all K0 Gaussian components. In the M-step we re-estimate parameters of the remaining components excluding the already fixed ones. After each M-step, the mixing coefficients are normalized such that they add up to. As the algorithm proceeds, larger clusters get fixed and the weighted resampling favors selection of observations from smaller clusters, thereby improving the chances of discovering smaller subpopulations. The resampling and model-fitting steps alternate until all the cluster parameters are fixed. A visual demonstration of the weighted sampling method is shown in Figure 2. It can be seen (see Supporting Information, Section B) that under idealized conditions when observed data are indeed drawn from a GMM and the parameters and posteriors for the fixed clusters are correctly estimated, the weighted iterative sampling algorithm proposed here exhibits the correct behavior. The samples obtained with the weighted resampling are equivalent to samples that would be drawn from a mixture model consisting of only the clusters that are not fixed (so far), where the mixing coefficients remain proportional to their values in the original mixture but are re-normalized to meet the unit sum constraint. Furthermore, in the presence of the large dynamic range for the mixing coefficients, the weighted iterative sampling mitigates problems with convergence in the vicinity of the true parameters (Supporting Information, Section B). The weighted iterative sampling significantly reduces the computational complexity of each iteration of EM from to, where n is the sample size ().

Figure 2

Weighted iterative sampling based Gaussian mixture model (GMM) clustering for better estimation of smaller subpopulations. Intermediate results along different stages of the algorithm and the final result are shown highlighting how smaller subpopulations are emphasized in the weighted iterative sampling process. [Color figure can be viewed in the online issue, which is available at http://wileyonlinelibrary.com.]

Input: : sequence of N data vectors K0: Number of initial Gaussian mixture components n: Sample size p: Number of components to fix at a time Output:: Parameters of the initial Gaussian mixture model (GMM) Obtain set of n random samples drawn from. Estimate GMM parameters using EM on. Estimate posterior probabilities γ via an E-step on using parameters. Let F be the set of Gaussian components whose parameters have been fixed. Initialize. repeat Determine F1 = {The p most populous Gaussian components for the current model. Fix the parameters of components. Set. Resample a set of n observations from with a weighted distribution where each observation is selected with probability. Apply modified EM algorithm on that does not update the parameters of already fixed components. In the M step, update only components. Normalize the mixing probabilities, computed in the M step to. Perform a single E-step on to recalculate the posteriors γ. Until all the components are fixed. parameters of all the (K0) Gaussian components. Perform a few (incremental) EM iterations on with as initial parameters. parameters estimated in the previous step. AlgorithmWeighted iterative sampling based EM in SWIFT Weighted iterative sampling based Gaussian mixture model (GMM) clustering for better estimation of smaller subpopulations. Intermediate results along different stages of the algorithm and the final result are shown highlighting how smaller subpopulations are emphasized in the weighted iterative sampling process. [Color figure can be viewed in the online issue, which is available at http://wileyonlinelibrary.com.]

Incremental EM iterations

Upon completion of the weighted iterative sampling based EM procedure for GMM fitting, SWIFT performs a few (typically 10) EM iterations on the entire dataset to improve the fit taking the entire data into account. However, even a few iterations on the entire dataset can be computationally expensive, particularly in terms of memory requirements; the posterior probability distribution requires storage, which can be prohibitive for large datasets. Therefore, we use memory-efficient IEM 22 (Supporting Information, Section A) for the iterations performed over the entire dataset. The IEM algorithm divides data into multiple blocks and performs a partial E-step, one block at a time. For each block, the partial E-step estimates the sufficient statistics for the associated block, which are used in the subsequent M-step for updating parameters. IEM is memory-efficient, because it processes only one block of data at a time. Moreover, IEM can exploit information from each data block earlier (without waiting for the entire data scan), and thus can improve the speed of convergence for large datasets 23 when each block is sufficiently large.

Multimodality Splitting

The initial GMM fitting may produce clusters that have several density maxima in the d-dimensional observation space. FC experts usually interpret each mode as a distinct subpopulation. Therefore, SWIFT splits such multimodal clusters into unimodal subclusters. Algorithm 2 summarizes this multimodality splitting procedure. Let be the set of observations associated with the ith Gaussian cluster. SWIFT estimates one-dimensional kernel density functions for each of the d observation dimensions and d principal components of, where the optimal smoothing parameter for the kernel density estimation procedure is determined in a data-dependent manner using the normal optimal smoothing method 24. A cluster is identified as multimodal if any of the kernel density functions has more than one local maximum. If the i-th initial cluster is identified as multimodal, SWIFT fits a K component GMM to, where K is the smallest number of components such that each fitted subcomponent corresponds to a unimodal set of observations. To estimate K, SWIFT initiates GMM fitting with a value of K = 2, and increases K ← K + 1 until each of the fitted subcomponents is unimodal. After performing splitting for all the initial multimodal clusters, we get a component GMM with refined parameters, where. For small clusters, many small spurious modes often result because of the fact that there are not enough observations to allow for reliable density estimation. Therefore, modes that are tsmall times smaller than the largest mode, for a chosen threshold tsmall are ignored in estimating modality. Furthermore, each multimodal cluster is split into no more than Kmax components. The upper bound Kmax is useful for the background clusters that are too diverse and sparse and require a large number of components in order to render each component unimodal.4 In the GMM fitting procedure in SWIFT, we also identify some clusters as “background clusters” through an automatic background detection technique that extends the method described in Ref. (9. Background clusters are identified by their low density and high volume, where the volume of a cluster is approximated by the determinant of its covariance matrix, and its density is estimated as the ratio of its population size to its volume 9. SWIFT identifies a cluster as “background” if its density is less than the overall data density, and the cluster volume is larger than mean cluster volume.5The sparse background clusters are typically multimodal in many dimensions. Depending on the biological study, a user may or may not want to split these background clusters. Biologists interested in major populations do not need to analyze background clusters. However, in some biological studies (e.g., stem cells, peptide stimulation, etc.), it is crucial to identify biologically significant small subpopulations (less than 100 observations, out of a total in the millions) that are assigned to background cluster(s). In such situations, these rare populations can be resolved by splitting the background cluster(s)—an option that can be enabled in SWIFT via a user-defined input parameter. Often background clusters do not have enough population sizes for reliable GMM fitting. To solve this problem, SWIFT performs an oversampling by replicating the observations in the background cluster with a small random perturbation and then performs splitting. This oversampling and background splitting operation is effective for finding rare subpopulations in large FC datasets. The multimodality splitting stage is the most computationally expensive step in the current SWIFT implementation. Let be the number of data points in the most populous multimodal cluster, Kmax be the upper bound on the number of resulting split clusters from a single multimodal cluster, Km be the number of such multimodal clusters, d be the number of dimensions, and Tmax be the maximum number of EM iterations allowed. Then the worst case computational complexity of the modality splitting stage is. Input: : Input dataset : parameters of the initial K0 component Gaussian mixture model Kmax: upper bound on maximum number of Gaussians fit to an initial cluster Output: : parameters of the refined Gaussian mixture model : refined number of Gaussians ← 0 for i = 1 K0 do ← set of observations in associated with the ith initial Gaussian cluster. K ← 1 if then repeat K ← K + 1 until Kmax or all the subclusters of are unimodal end ← + K Update the parameters according to end Return final parameters and final number of clusters. AlgorithmMultimodality splitting in SWIFT

LDA-Based Agglomerative Merging

The final step of SWIFT merges together Gaussian mixture components obtained from the GMM fitting and multimodality splitting stages, allowing representation of subpopulations with skewed but unimodal distributions. Merging mixture components to represent skewed subpopulations is well-established in the clustering literature 9,11,12,20,25,26. We propose a novel agglomerative merging algorithm based on Fisher linear discriminant analysis (LDA) 27 that outperforms previously proposed entropy-based merging method 26, in terms of both speed and accuracy (Supporting Information Fig. S7 and Table S1). The algorithm is explicitly motivated by the need to maintain distinct unimodal clusters in the observed datasets as distinct subpopulations. For a pair of clusters associated with two GMM components, LDA allows us to compute the one-dimensional projection of the d-dimensional data for which the separation between the clusters is maximized. Clusters for which the LDA projection is unimodal are also unimodal in the d-dimensional space and can therefore be merged without compromising unimodality. This intuition is the basis of the method that we adopt for merging, which is described next. The GMM estimation procedure combined with the modality based splitting process yields a set of Gaussian mixture components. For i = 1, 2, …,, denoting the ith Gaussian (mixture component) by g, we associate with it a corresponding cluster, comprising the subset of the observed data that the mixture model identifies as belonging to g. Our LDA merging algorithm successively merges pairs of Gaussians until no further merging is possible while maintaining unimodality of associated cluster data points. For each pair of Gaussians, the symmetric KL divergence defined as is computed and the pairs are considered for merging in ascending order of the pairwise symmetric KL divergence. For a pair of Gaussians under consideration, by using LDA on the corresponding pair of clusters, we determine a unit norm d × 1 vector for which separation between the clusters is maximized (on average) in the one-dimensional linear projections and of d-dimensional observations in and in. Specifically, maximizes the ratio of the squared-difference of projected means to the sum of individual cluster variances 27. For each element in the combined set of observations from the two clusters, a corresponding LDA projection is then obtained. Modes (local maxima) in the 1D kernel density estimate for sample projected data are then determined to test for unimodality of the LDA projections for the combined cluster. The combined cluster is also tested for unimodality along all its given dimensions and principal components. The class-wise dispersions and of the projected data and for the individual clusters are also evaluated and their ratio is computed. The pair of Gaussians is merged if the three following conditions are met: (a) the LDA projection is unimodal, (b) is unimodal along original data axes and principal component directions, and (c) is less than a certain threshold (we set). The screening based on dispersion ratio helps us to avoid merging a dense foreground cluster with a sparse background cluster. If a merge occurs, we proceed to the next iteration of agglomerative merging after computing the symmetric KL divergence of the merged cluster to other Gaussians in the GMM.6If on the other hand, a merge does not occur because at least one of the three test conditions is violated, we move on to the next pair in the ascending symmetric KL divergence order. The merging algorithm continues until no such pairs can be found. A sparse cluster may get subsumed by the tail of a dense cluster and may not appear as a separate mode even if the underlying distribution is multimodal. We avoid this pitfall by performing the LDA-based modality check not only for the actual observations of the two Gaussian clusters g and g, but also for synthetic data points randomly sampled from the Gaussians. By sampling an equal number of points from both components, issues related to imbalanced cluster densities are avoided. A naive implementation of the proposed LDA merging procedure requires LDA estimations in the worst case, resulting in complexity, where is the population size for the most populous cluster. We reduce the number of LDA estimations very significantly by filtering out Gaussian component pairs that have almost no overlap, because pairs of Gaussian components whose means differ by a large amount in relation to their standard deviation (in the d-dimensional space) will be multimodal in their LDA projection and need not be considered as prospects for merging. Specifically, we approximate a Gaussian component g by a multidimensional ellipsoid with center µ and dispersion, and estimate (multidimensional) rectangular bounding boxes for the ellipsoids. If the bounding boxes for two Gaussians do not intersect, then their associated ellipsoids cannot intersect and the corresponding pairs of Gaussians are considered non-overlapping. Determining whether 2 rectangular boxes in d-dimensions intersect requires only O(d) operations and is significantly faster than directly determining whether two d-dimensional ellipsoids intersect. A large number of candidate Gaussian pairs are eliminated from consideration by this efficient bounding box based filtering, and LDA estimation is required only for the remaining pairs. Moreover, at each merging step the LDA-based modality criterion needs to be recomputed only for the merged cluster produced in the previous merging step. Values for the other cluster pairs computed previously are reused, saving computation. Algorithm 3 summarizes the LDA based merging step used in SWIFT and Figure 3 presents a visualization of the operations in the algorithm using a sample 2-D dataset.

Figure 3

Cluster merging in SWIFT illustrated via a 2D example: (a) Four original skewed subpopulations, (b) Initial GMM fit, (c) Potential pairs considered for merging, the bounding box filtering introduced for computational efficiency eliminates all pairs except 1,2 and 5,6, and (d) Resulting clusters after merging. Note that in the final result, the original skewed and non-Gaussian subpopulations are well-represented via the merged clusters formed from combining initially fit Gaussians. [Color figure can be viewed in the online issue, which is available at http://wileyonlinelibrary.com.]

Input: : Input dataset : parameters of the component Gaussian mixture model Output: Θ: parameters of the combined mixture model K: final number of clusters Initialize: repeat for i = 1 to K′ do the ellipsoid with center, and dispersion end for each (i,j) such that do the smallest bounding box covering E the smallest bounding box covering E. if then end end Estimate the pairwise symmetric KL divergence among the Gaussian components in the current model. // See text for full details of unimodality test. Following version is abbreviated due to space constraints. for each ordered by ascending value of d do set of observations sampled from g set of observations sampled from g LDA () standard deviation of standard deviation of if isUnimodal and isUnimodal and then Merge the updated model after merging break end end until no more merging is possible Return final parameters Θ and final number of clusters K. Algorithm LDA-based agglomerative merging in SWIFT Cluster merging in SWIFT illustrated via a 2D example: (a) Four original skewed subpopulations, (b) Initial GMM fit, (c) Potential pairs considered for merging, the bounding box filtering introduced for computational efficiency eliminates all pairs except 1,2 and 5,6, and (d) Resulting clusters after merging. Note that in the final result, the original skewed and non-Gaussian subpopulations are well-represented via the merged clusters formed from combining initially fit Gaussians. [Color figure can be viewed in the online issue, which is available at http://wileyonlinelibrary.com.]

Results

For proper evaluation and validation of any clustering algorithm, one needs reliable ground truth data. To address this challenge, one can use either simulated data, or electronically mixed data. In this article, we report on experiments for evaluating SWIFT using both approaches. Detailed evaluation of SWIFT for a biologically relevant analysis is presented in the companion article 16.

Results on Simulated Data

In this section, using simulated mixtures of Gaussians, we evaluate SWIFT’s scalability and capability for detecting rare populations, and compare these against the traditional EM algorithm. The main reasons for using simulated data are two-fold. First, we know full ground truth for simulated data for each of the clusters. Second, the traditional EM algorithm is prohibitively slow for actual large, high dimensional FC datasets, making the direct comparison on actual FC data prohibitively time consuming (or impossible to complete using the computational hardware we use for SWIFT). A synthetic mixture of two-dimensional Gaussians with 6-components (shown in Fig. 4) was generated, where the mixing coefficients of the Gaussian components were chosen as, and to be representative of situations with large dynamic range that are of primary interest to us. For this dataset, GMM parameters were estimated by using both the traditional EM algorithm and SWIFT’s weighted iterative sampling based EM algorithm with the number of Gaussians K0 set to 6 in both cases. The sample size for the weighted sampling was chosen as n = 20,000.

Figure 4

Comparison of weighted sampling based EM and the traditional EM algorithm on a synthetic mixture of 6 Gaussians: (a) Original dataset, (b) GMM estimate from the weighted sampling based EM used in SWIFT, and (c) GMM estimate from traditional EM algorithm. Note that smallest subpopulation is missed by the traditional EM algorithm but is represented with good accuracy by the weighted sampling based EM used in SWIFT. [Color figure can be viewed in the online issue, which is available at http://wileyonlinelibrary.com.] For quantitative evaluation of clustering accuracy, we estimate the error by computing the symmetric Kullback–Leibler (KL) divergence between each estimated Gaussian parameter and the associated true Gaussian parameter, where correspondence between the estimated and true Gaussians is first determined by a weighted bipartite graph matching 28 (also using symmetric KL divergence as matching cost). For each cluster, the error in estimated parameters is computed as the symmetric KL divergence between the estimated parameters and the true parameters for the matching Gaussian determined by the bi-partite matching. An overall error is also computed as the sum of the errors over all six clusters. Since the EM algorithm only assures convergence to a local optimum, we performed 10 repeated runs of EM with random initializations, and chose the run with the maximum log-likelihood. To ensure the estimations are statistically significant, we performed the same experiment (EM fitting with 10 repetitions) 10 times and then finally estimated the average runtime, total error, and the error associated with the smallest cluster. The results are presented in Table 1 and are shown in Figure 4 for a typical EM run. The weighted iterative sampling based EM is nearly 18 times faster and estimates the parameters of the smallest cluster with significantly greater accuracy than the traditional EM algorithm, which performs rather poorly. The poor performance of the traditional EM is due to: (a) the slow convergence of EM in the presence of overlapping and small clusters (see Supporting Information, Section B), and (b) convergence of EM to poor local optima depending on random initialization. The results clearly illustrate the advantages of the weighted iterative sampling for large datasets with high dynamic range in mixing coefficients. The weighted iterative sampling also provides a significant computational benefit. For a typical d = 17 dimensional FC dataset with N = 1.5 million events, a pure IEM approach for the initial mixture modeling phase, without the weighted iterative sampling in SWIFT and with an IEM block size of 50,000, increases the computational time by a factor of 10.53 and memory requirement by a factor of 1.8 (reported data are on an 8-core 2.4 GhZ Mac workstation) while providing results comparable with the traditional EM where the smaller clusters are frequently overwhelmed by larger clusters, though this can often be remedied by the subsequent splitting and merging stages of SWIFT.

Table 1

	Weighted iterative sampling	Traditional EM
Avg runtime (s)	134.1	2414.1
Avg cumulative error	0.0157	37.687
Avg errors for the smallest cluster	0.0012	34.3397

Listed error values correspond to symmetric KL divergences averaged over 10 independent runs. See text for details.

Comparison of the weighted iterative sampling based EM against the traditional EM for a synthetic two-dimensional Gaussian mixture with mixing coefficients 1×106, 7.5×105, 1.9×105, 5×104, 1×104, and 2×103 chosen to be representative of the high dynamic range encountered for rare population detection Listed error values correspond to symmetric KL divergences averaged over 10 independent runs. See text for details. Although the above example explored a large dynamic range, typical dynamic ranges for FC data are even larger. In the above example, the smallest cluster had 2000 points out of a total of 2 million, whereas actual FC datasets often have biologically significant subpopulations with fewer than a hundred cells in a sample of 2 million cells. We therefore also evaluated the performance of the weighted iterative sampling based EM as the size of the smallest cluster is further reduced; specifically, we generated 5 mixtures, where the smallest cluster sizes are set to 1500, 1000, 500, 200, and 100, respectively and the remaining clusters were left unchanged from the previous example. The results obtained are summarized in Table 2 and indicate that SWIFT’s weighted iterative sampling works well until the point where the smallest cluster has 200 points out of a total of 2 million. Results incorporating the additional stages (split and merge) in SWIFT also included within the table show that these additional steps further improve SWIFT’s capability to detect small clusters.

Table 2

Performance of the weighted iterative sampling based EM and the overall SWIFT (weighted sampling + split + merge) for small cluster detection in a total population size of 2 million events.

	Weighted Sampling		Weighted Sampling + Split+Merge
Smallest Cluster Size	Avg Total Error	Smallest Cluster Error	Avg Total Error	Smallest Cluster Error
1500	0.0159	0.0019	0.1020	0.0003
1000	0.0128	0.0128	0.0198	0.0046
500	0.0220	0.0220	0.0751	0.0044
200	23.3622	23.3622	1.7141	1.4561
100	27.4113	27.0221	7.1430	6.7043

Listed error values correspond to symmetric KL divergences averaged over 10 independent runs. See text for details.

Performance of the weighted iterative sampling based EM and the overall SWIFT (weighted sampling + split + merge) for small cluster detection in a total population size of 2 million events. Listed error values correspond to symmetric KL divergences averaged over 10 independent runs. See text for details.

Results on Flow Cytometry Data

A key challenge in validation on actual FC data is the scarcity of datasets with ground truth. Visual identification of populations via manual gating is hardly a gold standard, because of several limitations. First, gating is usually focused, rather than exhaustive, and not suitable for validation of all clusters. Second, the gating procedure cannot exploit high dimensional features and is also less accurate in the presence of cluster overlap. Third, the subjectivity of gating is well-known to contribute to the variability of FC analysis results 29. Therefore, an objective validation is desirable. The Rochester Human Immunology Center generated a pair of datasets for which ground truth labels can be applied: one consisted of human peripheral blood cells, and the other consisted of mouse splenocytes. Both human and mouse cells were stained with the same set of fluorescently-labeled antibodies (directed against homologous proteins in both species) such that half of the antibodies were human-specific, and the rest were mouse-specific. Human antigens in a human cell bind only to the antihuman antibodies and express high signal for a subset of human antibodies and low signal for all the mouse antibodies. The mouse cells exhibit the opposite behavior. FC data was acquired for both samples using an LSR II cytometer (BD Immunocytometry Systems). The datasets are made available on the FlowRepository server 30 for use by other researchers for testing FC data analysis algorithms. We electronically mixed these two datasets (total 544,000 observations and 21 dimensions), and created a series of hybrid datasets containing both human and mouse cells, where the label for each cell (either human or mouse) is known because of the electronic mixing. SWIFT was used for clustering each electronic mixture without using the human/mouse label in the clustering process. An ideal clustering solution should resolve the distinction between human and mouse groups and produce clusters that contain either only human cells, or only mouse cells, but not both. We note here that the dataset and the evaluation task are explicitly designed to allow validation against known ground truth, which makes them atypical of common FC analysis tasks. A companion article 16 uses datasets and tasks that are typical of a substantial field of immune response evaluation and provides information on the validation of SWIFT’s ability to find rare clusters, and also to find clusters that are biologically significant. The initial Gaussian mixture model fitting was done with K0 = 80 Gaussian components. After the initial clustering, SWIFT’s multimodality splitting resulted in 148 Gaussians, and its LDA-based agglomerative merging resulted in 122 final clusters. Each of these 122 clusters was classified as either human or mouse by a majority decision rule. Figure 5a shows the actual number of human and mouse cells per cluster. Figure 5b shows the fractional proportion. Almost all the clusters are well-resolved as either only human or only mouse.

Figure 5

Results from SWIFT clustering of the known-ground-truth, electronically mixed, human-mouse dataset. SWIFT yields 122 clusters that clearly separate the human vs. mouse cells: most clusters are comprised of entirely human or entirely mouse cells. See text and caption for Supporting Information Fig. S.12 for details of the dataset. [Color figure can be viewed in the online issue, which is available at http://wileyonlinelibrary.com.] To evaluate SWIFT’s rare population detection using sensitivity analysis, we electronically mixed varying proportions of human and mouse cells and observed how its performance varied with decreasing proportion of human cells: 50%, 25%, 10%, 1%, and 0.1%. By definition, and. In this experiment, we benchmarked detection of the human clusters as the proportion of human cells decreases. Therefore, the precision and recall can be equivalently redefined as: The results (Table 3) show that SWIFT can resolve up to 1% human cells with high precision and recall. For the case of 0.1%, SWIFT correctly identified 2 human clusters with high recall, but the precision is relatively low (68.40%) because these human clusters also included quite a few mouse cells. For this dataset, we also compared SWIFT against FLOCK 5. FLOCK also resolves this simple dataset but with greater overlap (results shown in Supporting Information, Fig. S.12).

Table 3

Performance of SWIFT with varying proportion of human and mouse cells

Percentage of human cells (%)	Precision (%)	Recall (%)	Human clusters
50	99.59	99.93	49
25	99.62	99.83	33
10	99.43	95.90	21
1	91.82	99.34	11
0.1	68.40	99.48	2

Performance of SWIFT with varying proportion of human and mouse cells

Discussion

SWIFT incorporates several novel components to address the challenges arising in FC. All the three stages of SWIFT are motivated by two major requirements: scalability to large datasets and identification of rare populations. All major components of SWIFT (weighted iterative sampling, the incremental EM iterations, and efficient LDA-based merging) are designed to be efficiently scalable to big datasets, providing a significant improvement over the existing soft clustering methods 9– 12,14. SWIFT identifies rare populations using weighted iterative sampling and multimodality splitting. The multimodality splitting stage serves a critical role for rare subpopulation identification. SWIFT can also represent skewed clusters by LDA-based agglomerative merging, which reduces the number of clusters while preserving the distinct unimodal populations. The interplay between multimodality splitting and merging results in a reasonable number of clusters, uses a sensible heuristic (modality of clusters), and is more intuitive as compared to the knee point in BIC or entropy plots previously used 10,11. Finally, the soft clustering used in SWIFT is useful for comprehending overlapping clusters (Supporting Information, Section H) as compared with alternative hard clustering methods such as k-means 31 or spectral clustering 6. SWIFT is partly similar to flowPeaks 13 in that they both rely on the unimodality criterion. However, flowPeaks aims for major peaks only (no modality splitting stage), and tends to miss small overlapping clusters. The significance of modal regions in identifying interesting subpopulations has also motivated curvHDR 32, where high curvature regions are used to identify the modal regions, which are then exploited for (partly) automating gating. A recent article 14 describes an alternative approach to rare population detection and provides a point of reference for comparing SWIFT against the current state of the art in FC data analysis methods designed specifically for rare population identification. In 14, FC data are modeled as hierarchical Dirichlet process Gaussian mixture model (HDPGMM) to solve the dual problems of finding rare events potentially masked by nearby large populations and to provide alignment of cell subsets over multiple data samples. The HDPGMM is shown to identify biologically relevant subpopulations occurring at frequencies in the 0.01–0.1% of the entire dataset and the method is shown to be superior at finding rare populations as compared with manual gating (using a panel of 10 people), FLAME 12, FLOCK 33 (albeit indirectly), and flowClust 34. These comparisons were done with 3 color (five-dimensional) FCS 2.0 (FACSCalibur) dataset, having around 50,000 events. In our companion manuscript 16, we demonstrate that SWIFT handles much larger datasets (having tens of millions of events with 17 independent dimensions) and identifies cell subpopulations at a frequency as low as in 17-dimensional FC datasets of up to 25 million events, which is significantly more sensitive than the existing current state of the art. A direct comparison of SWIFT against other existing FC data analysis methods is stymied by the fact that most existing methods do not scale to the extremely large datasets we are exploring, nor are these designed to detect rare populations at the level of sensitivity targeted by SWIFT. These claims are supported by benchmarking results on smaller datasets that we report in the supporting information accompanying our companion manuscript 16. The weighted iterative sampling is one of the key contributions of SWIFT. Most of the existing scalable EM variants 35,36 do not specifically address the challenge of rare population detection. Moreover, some assumptions of these methods are quite restrictive. For example, the scalable EM (SEM) 35 algorithm requires the covariance matrix to be diagonal, and the multistage EM 36 assumes all the clusters to share the same covariance matrix. These assumptions are too restrictive for FC data. SWIFT provides sufficient flexibility by allowing full covariance matrices for each individual Gaussian and performs well in the presence of rare populations. Although we implemented the weighted iterative sampling for mixture of Gaussians only, the method is general enough and can be extended to other soft clustering methods (e.g., mixture of t distributions, mixture of skewed t distributions, fuzzy c-means, etc.). The LDA-based agglomerative merging combined with a pruning process allows efficient and robust merging of Gaussian mixture components. The efficiency of the LDA-based agglomerative merging carries over to other applications where the number of observations and the number of clusters are much larger than the number of dimensions. Unlike the entropy-based merging, our LDA criterion is insensitive to relative cluster population sizes (see Supporting Information, Section E and Fig. S.7), and is guided by the modality criterion.

Conclusion

This article presents the algorithm design for SWIFT (Scalable Weighted Iterative Flow-clustering Technique). SWIFT uses a three stage workflow consisting of iterative weighted sampling, multimodality splitting, and unimodality-preserving merging, to scale model-based clustering analysis to the large high-dimensional datasets common in modern FC, while retaining resolution of subpopulations with rather small relative sizes—populations that are often biologically significant. Evaluations over synthetic datasets demonstrate that SWIFT offers improvements over conventional model-based approaches in scaling to large datasets and in resolving small populations. In the companion manuscript 16, SWIFT is applied to a task typical in immune response evaluation and both scaling to very large FC datasets (having tens of millions of events) and capability to identify extremely rare populations (1 in of the total events) are demonstrated. SWIFT is available for download at http://www.ece.rochester.edu/projects/siplab/Software/SWIFT.html.

18 in total

Review 1. Seventeen-colour flow cytometry: unravelling the immune system.

Authors: Stephen P Perfetto; Pratip K Chattopadhyay; Mario Roederer
Journal: Nat Rev Immunol Date: 2004-08 Impact factor: 53.106

2. Rapid cell population identification in flow cytometry data.

Authors: Nima Aghaeepour; Radina Nikolic; Holger H Hoos; Ryan R Brinkman
Journal: Cytometry A Date: 2011-01 Impact factor: 4.355

Review 3. The flow of cytometry into systems biology.

Authors: John P Nolan; Loretta Yang
Journal: Brief Funct Genomic Proteomic Date: 2007-07-04

4. Automated gating of flow cytometry data via robust model-based clustering.

Authors: Kenneth Lo; Ryan Remy Brinkman; Raphael Gottardo
Journal: Cytometry A Date: 2008-04 Impact factor: 4.355

5. Automated high-dimensional flow cytometric data analysis.

Authors: Saumyadipta Pyne; Xinli Hu; Kui Wang; Elizabeth Rossin; Tsung-I Lin; Lisa M Maier; Clare Baecher-Allan; Geoffrey J McLachlan; Pablo Tamayo; David A Hafler; Philip L De Jager; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2009-05-14 Impact factor: 11.205

6. Combining Mixture Components for Clustering.

Authors: Jean-Patrick Baudry; Adrian E Raftery; Gilles Celeux; Kenneth Lo; Raphaël Gottardo
Journal: J Comput Graph Stat Date: 2010-06-01 Impact factor: 2.302

7. Data reduction for spectral clustering to analyze high throughput flow cytometry data.

Authors: Habil Zare; Parisa Shooshtari; Arvind Gupta; Ryan R Brinkman
Journal: BMC Bioinformatics Date: 2010-07-28 Impact factor: 3.169

8. Standardization of cytokine flow cytometry assays.

Authors: Holden T Maecker; Aline Rinfret; Patricia D'Souza; Janice Darden; Eva Roig; Claire Landry; Peter Hayes; Josephine Birungi; Omu Anzala; Miguel Garcia; Alexandre Harari; Ian Frank; Ruth Baydo; Megan Baker; Jennifer Holbrook; Janet Ottinger; Laurie Lamoreaux; C Lorrie Epling; Elizabeth Sinclair; Maria A Suni; Kara Punt; Sandra Calarota; Sophia El-Bahi; Gailet Alter; Hazel Maila; Ellen Kuta; Josephine Cox; Clive Gray; Marcus Altfeld; Nolwenn Nougarede; Jean Boyer; Lynda Tussey; Timothy Tobery; Barry Bredt; Mario Roederer; Richard Koup; Vernon C Maino; Kent Weinhold; Giuseppe Pantaleo; Jill Gilmour; Helen Horton; Rafick P Sekaly
Journal: BMC Immunol Date: 2005-06-24 Impact factor: 3.615

9. flowClust: a Bioconductor package for automated gating of flow cytometry data.

Authors: Kenneth Lo; Florian Hahne; Ryan R Brinkman; Raphael Gottardo
Journal: BMC Bioinformatics Date: 2009-05-14 Impact factor: 3.169

10. Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples.

Authors: Andrew Cron; Cécile Gouttefangeas; Jacob Frelinger; Lin Lin; Satwinder K Singh; Cedrik M Britten; Marij J P Welters; Sjoerd H van der Burg; Mike West; Cliburn Chan
Journal: PLoS Comput Biol Date: 2013-07-11 Impact factor: 4.475

25 in total

Review 1. Algorithmic Tools for Mining High-Dimensional Cytometry Data.

Authors: Cariad Chester; Holden T Maecker
Journal: J Immunol Date: 2015-08-01 Impact factor: 5.422

2. Methods for discovery and characterization of cell subsets in high dimensional mass cytometry data.

Authors: Kirsten E Diggins; P Brent Ferrell; Jonathan M Irish
Journal: Methods Date: 2015-05-13 Impact factor: 3.608

Review 3. Computational flow cytometry: helping to make sense of high-dimensional immunology data.

Authors: Yvan Saeys; Sofie Van Gassen; Bart N Lambrecht
Journal: Nat Rev Immunol Date: 2016-06-20 Impact factor: 53.106

4. Computational prediction of manually gated rare cells in flow cytometry data.

Authors: Peng Qiu
Journal: Cytometry A Date: 2015-03-09 Impact factor: 4.355

5. DAFi: A directed recursive data filtering and clustering approach for improving and interpreting data clustering identification of cell populations from polychromatic flow cytometry data.

Authors: Alexandra J Lee; Ivan Chang; Julie G Burel; Cecilia S Lindestam Arlehamn; Aishwarya Mandava; Daniela Weiskopf; Bjoern Peters; Alessandro Sette; Richard H Scheuermann; Yu Qian
Journal: Cytometry A Date: 2018-04-17 Impact factor: 4.355

6. Toward deterministic and semiautomated SPADE analysis.

Authors: Peng Qiu
Journal: Cytometry A Date: 2017-02-24 Impact factor: 4.355

Review 7. Detection of Rare Objects by Flow Cytometry: Imaging, Cell Sorting, and Deep Learning Approaches.

Authors: Denis V Voronin; Anastasiia A Kozlova; Roman A Verkhovskii; Alexey V Ermakov; Mikhail A Makarkin; Olga A Inozemtseva; Daniil N Bratashov
Journal: Int J Mol Sci Date: 2020-03-27 Impact factor: 5.923

8. Guidelines for the use of flow cytometry and cell sorting in immunological studies.

Authors: Andrea Cossarizza; Hyun-Dong Chang; Andreas Radbruch; Mübeccel Akdis; Immanuel Andrä; Francesco Annunziato; Petra Bacher; Vincenzo Barnaba; Luca Battistini; Wolfgang M Bauer; Sabine Baumgart; Burkhard Becher; Wolfgang Beisker; Claudia Berek; Alfonso Blanco; Giovanna Borsellino; Philip E Boulais; Ryan R Brinkman; Martin Büscher; Dirk H Busch; Timothy P Bushnell; Xuetao Cao; Andrea Cavani; Pratip K Chattopadhyay; Qingyu Cheng; Sue Chow; Mario Clerici; Anne Cooke; Antonio Cosma; Lorenzo Cosmi; Ana Cumano; Van Duc Dang; Derek Davies; Sara De Biasi; Genny Del Zotto; Silvia Della Bella; Paolo Dellabona; Günnur Deniz; Mark Dessing; Andreas Diefenbach; James Di Santo; Francesco Dieli; Andreas Dolf; Vera S Donnenberg; Thomas Dörner; Götz R A Ehrhardt; Elmar Endl; Pablo Engel; Britta Engelhardt; Charlotte Esser; Bart Everts; Anita Dreher; Christine S Falk; Todd A Fehniger; Andrew Filby; Simon Fillatreau; Marie Follo; Irmgard Förster; John Foster; Gemma A Foulds; Paul S Frenette; David Galbraith; Natalio Garbi; Maria Dolores García-Godoy; Jens Geginat; Kamran Ghoreschi; Lara Gibellini; Christoph Goettlinger; Carl S Goodyear; Andrea Gori; Jane Grogan; Mor Gross; Andreas Grützkau; Daryl Grummitt; Jonas Hahn; Quirin Hammer; Anja E Hauser; David L Haviland; David Hedley; Guadalupe Herrera; Martin Herrmann; Falk Hiepe; Tristan Holland; Pleun Hombrink; Jessica P Houston; Bimba F Hoyer; Bo Huang; Christopher A Hunter; Anna Iannone; Hans-Martin Jäck; Beatriz Jávega; Stipan Jonjic; Kerstin Juelke; Steffen Jung; Toralf Kaiser; Tomas Kalina; Baerbel Keller; Srijit Khan; Deborah Kienhöfer; Thomas Kroneis; Désirée Kunkel; Christian Kurts; Pia Kvistborg; Joanne Lannigan; Olivier Lantz; Anis Larbi; Salome LeibundGut-Landmann; Michael D Leipold; Megan K Levings; Virginia Litwin; Yanling Liu; Michael Lohoff; Giovanna Lombardi; Lilly Lopez; Amy Lovett-Racke; Erik Lubberts; Burkhard Ludewig; Enrico Lugli; Holden T Maecker; Glòria Martrus; Giuseppe Matarese; Christian Maueröder; Mairi McGrath; Iain McInnes; Henrik E Mei; Fritz Melchers; Susanne Melzer; Dirk Mielenz; Kingston Mills; David Mirrer; Jenny Mjösberg; Jonni Moore; Barry Moran; Alessandro Moretta; Lorenzo Moretta; Tim R Mosmann; Susann Müller; Werner Müller; Christian Münz; Gabriele Multhoff; Luis Enrique Munoz; Kenneth M Murphy; Toshinori Nakayama; Milena Nasi; Christine Neudörfl; John Nolan; Sussan Nourshargh; José-Enrique O'Connor; Wenjun Ouyang; Annette Oxenius; Raghav Palankar; Isabel Panse; Pärt Peterson; Christian Peth; Jordi Petriz; Daisy Philips; Winfried Pickl; Silvia Piconese; Marcello Pinti; A Graham Pockley; Malgorzata Justyna Podolska; Carlo Pucillo; Sally A Quataert; Timothy R D J Radstake; Bartek Rajwa; Jonathan A Rebhahn; Diether Recktenwald; Ester B M Remmerswaal; Katy Rezvani; Laura G Rico; J Paul Robinson; Chiara Romagnani; Anna Rubartelli; Beate Ruckert; Jürgen Ruland; Shimon Sakaguchi; Francisco Sala-de-Oyanguren; Yvonne Samstag; Sharon Sanderson; Birgit Sawitzki; Alexander Scheffold; Matthias Schiemann; Frank Schildberg; Esther Schimisky; Stephan A Schmid; Steffen Schmitt; Kilian Schober; Thomas Schüler; Axel Ronald Schulz; Ton Schumacher; Cristiano Scotta; T Vincent Shankey; Anat Shemer; Anna-Katharina Simon; Josef Spidlen; Alan M Stall; Regina Stark; Christina Stehle; Merle Stein; Tobit Steinmetz; Hannes Stockinger; Yousuke Takahama; Attila Tarnok; ZhiGang Tian; Gergely Toldi; Julia Tornack; Elisabetta Traggiai; Joe Trotter; Henning Ulrich; Marlous van der Braber; René A W van Lier; Marc Veldhoen; Salvador Vento-Asturias; Paulo Vieira; David Voehringer; Hans-Dieter Volk; Konrad von Volkmann; Ari Waisman; Rachael Walker; Michael D Ward; Klaus Warnatz; Sarah Warth; James V Watson; Carsten Watzl; Leonie Wegener; Annika Wiedemann; Jürgen Wienands; Gerald Willimsky; James Wing; Peter Wurst; Liping Yu; Alice Yue; Qianjun Zhang; Yi Zhao; Susanne Ziegler; Jakob Zimmermann
Journal: Eur J Immunol Date: 2017-10 Impact factor: 6.688

9. CYBERTRACK2.0: zero-inflated model-based cell clustering and population tracking method for longitudinal mass cytometry data.

Authors: Kodai Minoura; Ko Abe; Yuka Maeda; Hiroyoshi Nishikawa; Teppei Shimamura
Journal: Bioinformatics Date: 2021-07-12 Impact factor: 6.937

10. Guidelines for the use of flow cytometry and cell sorting in immunological studies (second edition).

Authors: Andrea Cossarizza; Hyun-Dong Chang; Andreas Radbruch; Andreas Acs; Dieter Adam; Sabine Adam-Klages; William W Agace; Nima Aghaeepour; Mübeccel Akdis; Matthieu Allez; Larissa Nogueira Almeida; Giorgia Alvisi; Graham Anderson; Immanuel Andrä; Francesco Annunziato; Achille Anselmo; Petra Bacher; Cosima T Baldari; Sudipto Bari; Vincenzo Barnaba; Joana Barros-Martins; Luca Battistini; Wolfgang Bauer; Sabine Baumgart; Nicole Baumgarth; Dirk Baumjohann; Bianka Baying; Mary Bebawy; Burkhard Becher; Wolfgang Beisker; Vladimir Benes; Rudi Beyaert; Alfonso Blanco; Dominic A Boardman; Christian Bogdan; Jessica G Borger; Giovanna Borsellino; Philip E Boulais; Jolene A Bradford; Dirk Brenner; Ryan R Brinkman; Anna E S Brooks; Dirk H Busch; Martin Büscher; Timothy P Bushnell; Federica Calzetti; Garth Cameron; Ilenia Cammarata; Xuetao Cao; Susanna L Cardell; Stefano Casola; Marco A Cassatella; Andrea Cavani; Antonio Celada; Lucienne Chatenoud; Pratip K Chattopadhyay; Sue Chow; Eleni Christakou; Luka Čičin-Šain; Mario Clerici; Federico S Colombo; Laura Cook; Anne Cooke; Andrea M Cooper; Alexandra J Corbett; Antonio Cosma; Lorenzo Cosmi; Pierre G Coulie; Ana Cumano; Ljiljana Cvetkovic; Van Duc Dang; Chantip Dang-Heine; Martin S Davey; Derek Davies; Sara De Biasi; Genny Del Zotto; Gelo Victoriano Dela Cruz; Michael Delacher; Silvia Della Bella; Paolo Dellabona; Günnur Deniz; Mark Dessing; James P Di Santo; Andreas Diefenbach; Francesco Dieli; Andreas Dolf; Thomas Dörner; Regine J Dress; Diana Dudziak; Michael Dustin; Charles-Antoine Dutertre; Friederike Ebner; Sidonia B G Eckle; Matthias Edinger; Pascale Eede; Götz R A Ehrhardt; Marcus Eich; Pablo Engel; Britta Engelhardt; Anna Erdei; Charlotte Esser; Bart Everts; Maximilien Evrard; Christine S Falk; Todd A Fehniger; Mar Felipo-Benavent; Helen Ferry; Markus Feuerer; Andrew Filby; Kata Filkor; Simon Fillatreau; Marie Follo; Irmgard Förster; John Foster; Gemma A Foulds; Britta Frehse; Paul S Frenette; Stefan Frischbutter; Wolfgang Fritzsche; David W Galbraith; Anastasia Gangaev; Natalio Garbi; Brice Gaudilliere; Ricardo T Gazzinelli; Jens Geginat; Wilhelm Gerner; Nicholas A Gherardin; Kamran Ghoreschi; Lara Gibellini; Florent Ginhoux; Keisuke Goda; Dale I Godfrey; Christoph Goettlinger; Jose M González-Navajas; Carl S Goodyear; Andrea Gori; Jane L Grogan; Daryl Grummitt; Andreas Grützkau; Claudia Haftmann; Jonas Hahn; Hamida Hammad; Günter Hämmerling; Leo Hansmann; Goran Hansson; Christopher M Harpur; Susanne Hartmann; Andrea Hauser; Anja E Hauser; David L Haviland; David Hedley; Daniela C Hernández; Guadalupe Herrera; Martin Herrmann; Christoph Hess; Thomas Höfer; Petra Hoffmann; Kristin Hogquist; Tristan Holland; Thomas Höllt; Rikard Holmdahl; Pleun Hombrink; Jessica P Houston; Bimba F Hoyer; Bo Huang; Fang-Ping Huang; Johanna E Huber; Jochen Huehn; Michael Hundemer; Christopher A Hunter; William Y K Hwang; Anna Iannone; Florian Ingelfinger; Sabine M Ivison; Hans-Martin Jäck; Peter K Jani; Beatriz Jávega; Stipan Jonjic; Toralf Kaiser; Tomas Kalina; Thomas Kamradt; Stefan H E Kaufmann; Baerbel Keller; Steven L C Ketelaars; Ahad Khalilnezhad; Srijit Khan; Jan Kisielow; Paul Klenerman; Jasmin Knopf; Hui-Fern Koay; Katja Kobow; Jay K Kolls; Wan Ting Kong; Manfred Kopf; Thomas Korn; Katharina Kriegsmann; Hendy Kristyanto; Thomas Kroneis; Andreas Krueger; Jenny Kühne; Christian Kukat; Désirée Kunkel; Heike Kunze-Schumacher; Tomohiro Kurosaki; Christian Kurts; Pia Kvistborg; Immanuel Kwok; Jonathan Landry; Olivier Lantz; Paola Lanuti; Francesca LaRosa; Agnès Lehuen; Salomé LeibundGut-Landmann; Michael D Leipold; Leslie Y T Leung; Megan K Levings; Andreia C Lino; Francesco Liotta; Virginia Litwin; Yanling Liu; Hans-Gustaf Ljunggren; Michael Lohoff; Giovanna Lombardi; Lilly Lopez; Miguel López-Botet; Amy E Lovett-Racke; Erik Lubberts; Herve Luche; Burkhard Ludewig; Enrico Lugli; Sebastian Lunemann; Holden T Maecker; Laura Maggi; Orla Maguire; Florian Mair; Kerstin H Mair; Alberto Mantovani; Rudolf A Manz; Aaron J Marshall; Alicia Martínez-Romero; Glòria Martrus; Ivana Marventano; Wlodzimierz Maslinski; Giuseppe Matarese; Anna Vittoria Mattioli; Christian Maueröder; Alessio Mazzoni; James McCluskey; Mairi McGrath; Helen M McGuire; Iain B McInnes; Henrik E Mei; Fritz Melchers; Susanne Melzer; Dirk Mielenz; Stephen D Miller; Kingston H G Mills; Hans Minderman; Jenny Mjösberg; Jonni Moore; Barry Moran; Lorenzo Moretta; Tim R Mosmann; Susann Müller; Gabriele Multhoff; Luis Enrique Muñoz; Christian Münz; Toshinori Nakayama; Milena Nasi; Katrin Neumann; Lai Guan Ng; Antonia Niedobitek; Sussan Nourshargh; Gabriel Núñez; José-Enrique O'Connor; Aaron Ochel; Anna Oja; Diana Ordonez; Alberto Orfao; Eva Orlowski-Oliver; Wenjun Ouyang; Annette Oxenius; Raghavendra Palankar; Isabel Panse; Kovit Pattanapanyasat; Malte Paulsen; Dinko Pavlinic; Livius Penter; Pärt Peterson; Christian Peth; Jordi Petriz; Federica Piancone; Winfried F Pickl; Silvia Piconese; Marcello Pinti; A Graham Pockley; Malgorzata Justyna Podolska; Zhiyong Poon; Katharina Pracht; Immo Prinz; Carlo E M Pucillo; Sally A Quataert; Linda Quatrini; Kylie M Quinn; Helena Radbruch; Tim R D J Radstake; Susann Rahmig; Hans-Peter Rahn; Bartek Rajwa; Gevitha Ravichandran; Yotam Raz; Jonathan A Rebhahn; Diether Recktenwald; Dorothea Reimer; Caetano Reis e Sousa; Ester B M Remmerswaal; Lisa Richter; Laura G Rico; Andy Riddell; Aja M Rieger; J Paul Robinson; Chiara Romagnani; Anna Rubartelli; Jürgen Ruland; Armin Saalmüller; Yvan Saeys; Takashi Saito; Shimon Sakaguchi; Francisco Sala-de-Oyanguren; Yvonne Samstag; Sharon Sanderson; Inga Sandrock; Angela Santoni; Ramon Bellmàs Sanz; Marina Saresella; Catherine Sautes-Fridman; Birgit Sawitzki; Linda Schadt; Alexander Scheffold; Hans U Scherer; Matthias Schiemann; Frank A Schildberg; Esther Schimisky; Andreas Schlitzer; Josephine Schlosser; Stephan Schmid; Steffen Schmitt; Kilian Schober; Daniel Schraivogel; Wolfgang Schuh; Thomas Schüler; Reiner Schulte; Axel Ronald Schulz; Sebastian R Schulz; Cristiano Scottá; Daniel Scott-Algara; David P Sester; T Vincent Shankey; Bruno Silva-Santos; Anna Katharina Simon; Katarzyna M Sitnik; Silvano Sozzani; Daniel E Speiser; Josef Spidlen; Anders Stahlberg; Alan M Stall; Natalie Stanley; Regina Stark; Christina Stehle; Tobit Steinmetz; Hannes Stockinger; Yousuke Takahama; Kiyoshi Takeda; Leonard Tan; Attila Tárnok; Gisa Tiegs; Gergely Toldi; Julia Tornack; Elisabetta Traggiai; Mohamed Trebak; Timothy I M Tree; Joe Trotter; John Trowsdale; Maria Tsoumakidou; Henning Ulrich; Sophia Urbanczyk; Willem van de Veen; Maries van den Broek; Edwin van der Pol; Sofie Van Gassen; Gert Van Isterdael; René A W van Lier; Marc Veldhoen; Salvador Vento-Asturias; Paulo Vieira; David Voehringer; Hans-Dieter Volk; Anouk von Borstel; Konrad von Volkmann; Ari Waisman; Rachael V Walker; Paul K Wallace; Sa A Wang; Xin M Wang; Michael D Ward; Kirsten A Ward-Hartstonge; Klaus Warnatz; Gary Warnes; Sarah Warth; Claudia Waskow; James V Watson; Carsten Watzl; Leonie Wegener; Thomas Weisenburger; Annika Wiedemann; Jürgen Wienands; Anneke Wilharm; Robert John Wilkinson; Gerald Willimsky; James B Wing; Rieke Winkelmann; Thomas H Winkler; Oliver F Wirz; Alicia Wong; Peter Wurst; Jennie H M Yang; Juhao Yang; Maria Yazdanbakhsh; Liping Yu; Alice Yue; Hanlin Zhang; Yi Zhao; Susanne Maria Ziegler; Christina Zielinski; Jakob Zimmermann; Arturo Zychlinsky
Journal: Eur J Immunol Date: 2019-10 Impact factor: 6.688