Mass cytometry (CyTOF) has become a method of choice for in-depth characterization of tissue heterogeneity in health and disease, and is currently implemented in multiple clinical trials, where higher quality standards must be met. Currently, preprocessing of raw files is commonly performed in independent standalone tools, which makes it difficult to reproduce. Here, we present an R pipeline based on an updated version of CATALYST that covers all preprocessing steps required for downstream mass cytometry analysis in a fully reproducible way. This new version of CATALYST is based on Bioconductor's SingleCellExperiment class and fully unit tested. The R-based pipeline includes file concatenation, bead-based normalization, single-cell deconvolution, spillover compensation and live cell gating after debris and doublet removal. Importantly, this pipeline also includes different quality checks to assess machine sensitivity and staining performance while allowing also for batch correction. This pipeline is based on open source R packages and can be easily be adapted to different study designs. It therefore has the potential to significantly facilitate the work of CyTOF users while increasing the quality and reproducibility of data generated with this technology. Copyright:
Mass cytometry (CyTOF) has become a method of choice for in-depth characterization of tissue heterogeneity in health and disease, and is currently implemented in multiple clinical trials, where higher quality standards must be met. Currently, preprocessing of raw files is commonly performed in independent standalone tools, which makes it difficult to reproduce. Here, we present an R pipeline based on an updated version of CATALYST that covers all preprocessing steps required for downstream mass cytometry analysis in a fully reproducible way. This new version of CATALYST is based on Bioconductor's SingleCellExperiment class and fully unit tested. The R-based pipeline includes file concatenation, bead-based normalization, single-cell deconvolution, spillover compensation and live cell gating after debris and doublet removal. Importantly, this pipeline also includes different quality checks to assess machine sensitivity and staining performance while allowing also for batch correction. This pipeline is based on open source R packages and can be easily be adapted to different study designs. It therefore has the potential to significantly facilitate the work of CyTOF users while increasing the quality and reproducibility of data generated with this technology. Copyright:
Over the past decade, mass cytometry (CyTOF) has advanced our understanding of a wide range of cellular processes, particularly in the field of immunology and tumor biology
, by enabling the simultaneous measurement of 40+ parameters at the single cell level. Currently, mass cytometry is transitioning from an exploratory research approach toward a diagnostic tool used in clinical laboratories and this transition is associated with an increased need for standardization
. Various studies have already suggested improvements on the experimental workflows to increase the robustness of mass cytometry data by working with frozen antibody cocktails or by including shared reference samples in each independent experiment to enable for batch correction
. Similarly, advanced downstream analyses benefit from the large number of analysis tools and algorithms implemented in R, which allow for fully reproducible analyses
.Between data generation and downstream data analysis, data preprocessing is an multi-step procedure required to convert raw FCS files into data objects that can be input to downstream statistical analysis and visualization
. Upon data collection, the first step consists in concatenating files from sequential CyTOF acquisitions and removing events with unstable signal, which are usually caused by uneven flow rate or introduction of air in the fluidic system. As a second step, CyTOF data need to be corrected for time dependent signal drift, which is mostly due to cone contamination, mass calibration drift or loss of detector sensitivity over time. This correction is performed by acquiring metal tagged polystyrene beads together with the cell suspension, where bead signals can be used as a reference to normalize the cell signals
. Another potential artefact in CyTOF data is due to signal spillover between channels. Although lower than what is usually observed in fluorescent flow cytometry, spillover in mass cytometry can still account for up to 4% of the signal in some channels and needs to be corrected using signal compensation
. Sample barcoding prior to staining is a common approach used in mass cytometry to combine multiple samples in a single experiment to minimize experimental variation due to staining and CyTOF acquisition. In this case, individual cells have to be assigned to their respective sample via a process called single cell debarcoding
. In large studies where samples are collected over a long period of time by different users, on different machines or at different sites, an important step is to correct for batch effects, which can be achieved by including a shared control sample in each independent batch
. Finally, only live, intact single cells are relevant for the downstream analysis. Beads, doublets, debris and dead cells are excluded by gating on scatter plots
.Each step of the preprocessing pipeline requires expert decisions to determine the best parameters to achieve an optimal signal correction and cell selection. Moreover, all the chosen parameters should be recorded for reproducibility purposes. Despite these requirements, many current preprocessing pipelines still rely on switching between platforms that include, for example, MATLAB applications and (at least partially) closed source online platforms (e.g., Cytobank
). This approach necessitates uploading the data to different platforms and carrying out certain steps in a purely manual fashion, which makes it time-consuming and difficult to reproduce. This is particularly limiting in a clinical setting, where reproducibility and large-scale data analysis are required. Thus, we propose a semi-automated R-based preprocessing pipeline for CyTOF data that is: i) fully reproducible; ii) includes quality checks and, iii) has limited need for supervision once the original setup has been made. This pipeline is developed around an updated version of
CATALYST, an R package designed for preprocessing and differential analysis of mass cytometry data
. This new version of
CATALYST is based on Bioconductor’s
SingleCellExperiment class, the standard for high dimensional single cell data analysis. This pipeline can easily be adapted to each CyTOF user’s needs and will accelerate CyTOF data preprocessing while improving the quality of mass cytometry data generated.
Data description
The data used in this pipeline were generated in the context of the Tumor Profiler project, a multi-center observational study investigating the relevance of different innovative technologies, including CyTOF, imaging mass cytometry, single-cell DNA and RNA sequencing, as well as
ex vivo drug testing to improve the diagnostic of advanced cancer patients
.The samples of interest included tumor biopsies and blood samples collected at the University Hospital Zurich in spring 2020. These samples were assessed by mass cytometry in the context of a set of references including commercially available cell lines, PBMCs from healthy donors and PHA activated PBMCs. PBMCs from patients and healthy donors were collected based on a ficoll gradient
, and tumor samples were dissociated as previously described
. Once in single-cell suspension, all samples were stained for 5 min on ice with Cell-ID
TM Cisplatin-194Pt (#201194, Standard BioTools) to identify dead cells and subsequently fixed with PFA 1.6% (#15710, Electron Microscopy Sciences). Samples were stored as dry pellet at
−80°C until CyTOF measurement.The dataset used in this study was obtained from a single CyTOF experiment, also called batch, where nine references, two blood samples and two tumor samples were barcoded with a 20-well barcoding plate
. Reference samples were selected to contain positive and negative populations for each marker included in the study’s antibody panel. This design was chosen to enable for quality control and batch correction across independent experiments based on quantile scaling as described in
11. Pooled cells were stained with a 40-Ab panel designed to perform an in-depth characterization of the samples’ immune compartment. DNA intercalation was performed with a 1h incubation in Cell-ID
TM Intercalator-Ir (#201192B, Fluidigm). Finally, the cell suspension was diluted 1:10 in Maxpar® Cell Acquisition Solution (#201240, Fluidigm) and 10% of EQ Four Element Calibration Beads (#201078, Fluidigm), and acquired on a Helios™ upgraded CyTOF 2 system at a flow rate of 150 events per second.Throughout this workflow, we will make use of a set of metadata for standard preprocessing steps (normalization, debarcoding and compensation), as well as various quality controls previously acquired over seven independent experiments. An overview of the metadata used is given in
Table 1.
Table 1.
Overview of metadata files used throughout this pipeline, including each file’s description, dimensionality (if appropriate), and purpose for preprocessing or quality control.
Description
Purpose
normalization_beads.fcs
Beads identified using
CATALYST during the normalization
step of a previous CyTOF experiment.
Used as reference beads to correct for changes
in signal sensitivity over time.
ref_bead_counts.csv
A table of mean dual counts for the 6 different bead channels
(columns) obtained from 7 previous CyTOF experiments (rows).
Used as a reference to assess the measurement
sensitivity.
debarcoding_scheme.csv
A binary barcoding scheme of 6-
choose-3 = 20 barcodes with
columns corresponding to barcode channel masses (101, 104,
105, 106, 108, 110) and rows corresponding to barcodes (7
empty, 9 references, 2 PBMC and 2 tumor samples)
Used for single-cell deconvolution of multiplexed
samples.
spillover_matrix.csv
A spillover matrix calculated with
CATALYST from beads single-
stained with each of the 40 antibodies included in the panel
used in this study. The matrix contains, for each measurement
channel (rows), the percentage of signal received by all other
channels (columns).
Used for correction of spillover.
ref_cell_counts.csv
A table of the number of cells measured in 7 previous
experiments, each including 4 cell line, 3 PBMC and 2 tumor
references samples (63 samples in total).
Used to assess reference sample cell yields in the
current in comparison to previous experiments
sample_cell_counts.csv
A table of the number of cells measured in 7 previous
experiments, each including 2 PBMC and 2 tumor samples each (28
samples in total).
Used to assess sample cell yields in the current in
comparison to previous experiments
ref_marker_levels.csv
A table of the 98th expression percentiles for each target
(columns) across 7 previous experiments (rows).
Used to assess the staining efficiency of the
current experiment
Data organization
Most data used and returned throughout this workflow are kept in an object of Bioconductor’s
SingleCellExperiment (SCE) class from the
SingleCellExperiment package
. This data structure can store all single-cell related data (measurement data and transformations thereof; cell, feature and experiment-wide metadata; dimensionality reductions), allowing for synchronized and thus less error-prone data manipulation.The key component of SCEs are matrix-like
assays, where rows are features (targets) and columns are observations (cells), that store the measurement data and any data derived thereof. Metadata associated with cells are stored under
colData, feature metadata under
rowData, and any experiment-wide metadata may be stored in the
metadata slot. Lastly, the SCE can store an arbitrary number of dimensionality reductions under
reducedDims. For a more detailed description of usage and structure of SCEs, we refer to the
package’s documentation.
Results
The pipeline presented here describes all steps required to process raw mass cytometry data to a state where the user may proceed with downstream analyses (e.g., dimensionality reduction, differential analysis, trajectory inference). The process includes the concatenation of the individual acquisitions, the exclusion of part of the acquisition with unstable signal, the correction for time-dependent signal drift via bead normalization, the correction for signal spillover via compensation, the selection of cells of interest via automated gating, and the correction for batch effects. The workflow is exemplified on data from a single CyTOF experiment collected via three successive acquisitions (individual FCS files) of 15 barcoded samples mixed with calibration beads.Throughout, raw measurement data (FCS files) as well as all metadata (for debarcoding, normalization, compensation, and quality control) are expected to be located inside a
data/ subdirectory (relative to where the code is being run); otherwise, the presented file paths require modification.We use
CATALYST
to perform key preprocessing steps, including: concatenation, normalization, debarcoding and compensation;
and
for gating;
,
and
for visualization;
, `r CRANpkg(“
reshape2”)´
and
for data accession and manipulation; and
to compute polygonal live gates. Thus, our workflow is limited to the following dependencies:Besides standard preprocessing steps, we include quality control (QC) steps to assess CyTOF sensitivity, staining efficacy, and cell yield; these rely on results from previous experiments (
n = 7) as a reference. For consistent visualization, we define a common plotting theme for boxplots that are used to compare the current to previous experiments:
Constructing a
SingleCellExperiment
By default,
’s
read.FCS() function, which underlies
read.flowSet() for reading in a set of FCS files, transforms channel intensities and removes events with extreme values. To omit this behavior, we recommend reading in files with arguments
transformation = FALSE and
truncate_max_range = FALSE; by default, files will be read in by
CATALYST’s prepData() function with these settings.As described above, the SCE class allows the keeping of multiple data transformations in a single object. Thus, when applying a transformation to arrive at expression-like data, we can store the transformed data in a separate assay without overwriting the raw ion count data. In this way, any data generated and used throughout preprocessing (e.g., normalized, compensated or batch-corrected counts and their arcsinh-transformed counterparts) can be in principal retained, and written to intermediate FCS files for backup or quality control outside of R. However, it is worth noting that this procedure could lead to a shortage of memory for large datasets, in which case overwriting the data at each step is advisable; if not specified otherwise,
CATALYST overwrites by default.A SCE can be constructed using
CATALYST’s prepData() function, which accepts a path to a directory with one or many FCS files, a character vector of FCS filenames, a single or list of
flowFrame(s), or a
flowSet (
package
). By default (
transform = TRUE), an arcsinh-transformation with
cofactor = 5 is applied to the input (count) data, and the resulting expression matrix is stored in the
exprs assay slot of the output SCE:Initially, our SCE has two assays containing dual ion counts (assay
counts) and cofactor-5 arcsinhtransformed counts (assay
exprs). The cofactor used for transformation is stored inside the object’s internal metadata (
int_metadata(sce)$cofactor), and the FCS file of origin for each cell under cell metadata column
sample_id (accessible via
colData(sce)$sample_id or, equivalently,
sce$sample_id). In our dataset, FCS files correspond to acquisitions rather than biological samples. Thus, we rename the cell metadata variable
sample_id to
file_id to avoid ambiguity:The total number of cells across all acquisitions corresponds to the number of columns in the SCE (
ncol(sce): 368152). We can summarize the number of cells in each file by tabulating the
file_ids:In both mass and flow cytometry, each feature has both a channel and target associated with it. As can be seen from printing the
sce variable above,
prepData() defaults to using targets as rownames (when available). We can retrieve each feature’s measurement channel using the
channels() accessor, and use channel metals and masses to extract the indices of features that are relevant to different preprocessing steps. Namely, we assign channels measuring DNA to the variable
dna (here, Ir191 and Ir193), and channels for live gating (here, Ir191 for DNA and Pt194 for cisplatin) to
live:
Filtering for stable signal
High quality data generation requires a stable signal throughout the acquisition. Various issues can lead to signal change over time, including unstable flow rate, introduction of air or introduction of metal contamination in the system. These changes in signal intensity can vary in terms of duration and intensity, and can affect all or only a subsets of channels simultaneously. In order to detect regions of the acquisition affected by signal instability, we display the signal for selected channels as a function of time in a scatter plot (
Figure 1).
Figure 1.
Scatter plots of DNA channel Ir191 and the 41 channels measuring antigens against time.
Bins are colored by cell density; y-axis corresponds to cofactor-5 arcsinh-transformed dual counts.
Scatter plots of DNA channel Ir191 and the 41 channels measuring antigens against time.
Bins are colored by cell density; y-axis corresponds to cofactor-5 arcsinh-transformed dual counts.In this particular experiment, we do not observe time-related signal instability. In case part of the acquisition should be excluded, this could be done by manually gating on the region with stable signal, and subsequent subsetting to only retain cells that fall within the gate’s boundaries (argument
pop = "+"). Vice versa, it is possible to select a region with unstable signal, and remove it from the SCE object (
pop = "-"). For the sake of completeness, we include how a region of unstable signal could be excluded via manual gating:
Normalization
In the case of mass cytometry, signal drift during acquisition due to a progressive loss of sensitivity must be accounted and normalized for. A widely established strategy is to mix samples with polystyrene beads embedded with metal lanthanides, allowing monitoring of instrument performance throughout data acquisition
. These beads are in turn used to estimate and correct for the signal’s time drift. When independent experiments have to be analyzed in the same context, variation due to changes in instrument performance over time combined with intervals between scheduled maintenance have to be taken into account as well. In this case, the bead signal should be normalized to a set of reference beads from an earlier experiment. This ensures that different experiments are normalized to the same level, independent of the CyTOF’s sensitivity.A MATLAB tool to perform normalization outside of R was available until recently at
nolanlab/bead-normalization; current R implementations are available through
and
.
CATALYST provides an extension of bead-based normalization as described by Finck
et al.
, with automated identification of bead singlets (used for normalization), as well as of bead-bead and cell-bead doublets (to be removed), thus eliminating the need for manual gating. This is implemented as follows:beads are initially identified as those events that have their highest signals in the bead channelscell-bead doublets are removed by applying a separation cutoff on the distance between the lowest bead and highest non-bead channel signalevents passing all vertical gates defined by the lower bounds of bead signals are removed (these include bead-bead and bead-cell doublets)bead-bead doublets are removed by applying a default
median ± 5
mad rule to events identified in step 2; the remaining bead events are used for normalizationThe above procedure is carried out by a single function,
normCytof(), which takes as input a SCE and a set of arguments that control the normalization parameters and output format. Here, we specify
dna = 191 (Ir191) and
beads = "dvs", corresponding to DVS Science beads (lanthanides Ce140, Eu151, Eu153, Ho165, Lu175). Secondly, we provide the path to a set of reference beads (argument
norm_to) that are used to compute baseline intensities for normalization. Lastly, we set
overwrite = FALSE to retain both raw and normalized data, and
remove_beads = TRUE to exclude bead and doublet events:When
remove_beads = TRUE (the default),
normCytof() will return a list of three SCEs containing filtered, bead and remove events, respectively, as well as two
ggplot objects:The first SCE (
res$data) contains the filtered data with the additional assay slot
"normed" housing normalized expressions. The remaining two SCEs are data subsets that contain any events identified as beads (slot
beads) and all removed events (including beads, bead-bead and bead-cell doublets; slot
removed), respectively; thus, the
beads themselves are a subset of the
removed events. Here, we compare the number and percentage of cells contained in each subset:As a first quality control plot,
res$scatter (
Figure 2) renders scatter plots of bead channels (x-axis) versus DNA (y-axis), where events identified as beads as well as their expression range are highlighted in color; bead events should have low DNA intensity (since they are not cells) and high intensities across all bead channels.
Figure 2.
Scatter plots for bead channels vs. DNA.
Events identified as beads are colored in blue; for each bead channel, expression ranges across all bead events are indicated as rectangular gates. Events were downsampled to at most 10,000 for visualization.
Scatter plots for bead channels vs. DNA.
Events identified as beads are colored in blue; for each bead channel, expression ranges across all bead events are indicated as rectangular gates. Events were downsampled to at most 10,000 for visualization.Secondly,
res$lines (
Figure 3) displays smoothed median bead intensities before and after normalization; these typically decrease with time prior to normalization, and should be approximately constant and centered around the baseline after normalization. In our dataset, normalization is performed based on previously acquired reference beads. Thus, baseline values correspond to the reference bead’s mean bead channel intensities. As shown in
Figure 3, the bead channel levels are considerably lower after normalization, indicating higher sensitivity in the current experiment. Importantly, the slight decrease in signal over time is no longer present after normalization.
Figure 3.
Running-median smoothed bead intensities vs. time before and after normalization; colored by bead channel.
In order to assess the sensitivity of the CyTOF during acquisition and identify potential issues that would have remained undetected during the tuning of the instrument, we compute the mean bead channel counts across events identified as beads (
res$beads subset). A logical vector of which channels correspond to beads is stored under
rowData column
bead_ch, which we can use to subset the
counts assay to include bead channels only.To assess the measurement sensitivity during the current experiment, we compare the mean bead channel counts computed above to those obtained from 7 previously acquired experiments available in metadata table
ref_bead_counts.csv. The resulting boxplot (
Figure 4) shows that the current experiment’s sensitivity is relatively high, but well in the range of previous experiments.
Figure 4.
Bead channel count quality control.
Boxplot comparing the mean dual ion counts (y-axis) across bead channels (x-axis) obtained for the current experiment (red crosses) to those from 7 previously acquired reference experiments (boxes).
Bead channel count quality control.
Boxplot comparing the mean dual ion counts (y-axis) across bead channels (x-axis) obtained for the current experiment (red crosses) to those from 7 previously acquired reference experiments (boxes).After normalization, we overwrite the input dataset with the filtered subset that no longer includes bead events, or bead-bead and bead-cell doublets:
Debarcoding
In mass cytometry, samples are often labeled with unique sample-specific barcodes and pooled together for processing and measurement, an approach termed
multiplexing
. The most widely used barcoding scheme is based on Zunder
et al.
, and relies on binary palladium-based mass-tag cell barcoding. Here, each sample
i = 1,...,
n is either positive or negative for each of
m palladium isotopes, resulting in an
m-choose-
k barcoding scheme, where
k is the number of positive barcodes. For example, labeling of three out of six palladium isotopes will result in
unique barcodes. In order to recover the individual samples for further analysis, the pooled dataset is debarcoded (or
deconvoluted) computationally.The single cell debarcoding (SCD) algorithm first sorts each cell’s barcode intensities to assign preliminary barcode IDs such that a cell is assigned to the barcode population for which its barcode intensities are highest. Next, intensities within each barcode population are scaled using the 95th expression quantiles, and thereby brought to a comparable scale. Finally, events whose separation between highest negative and lowest positive barcode intensity is below a threshold value (
separation cutoff) are left unassigned.In the initial SCD algorithm, sample yields are determined by a single global cutoff on the separation between positive and negative barcode populations. Naturally, this procedure is suboptimal when yields as a function of the applied cutoff do not decline simultaneously. To optimize cell yields in such cases,
provides an option to automatically estimate or specify
sample-specific separation cutoffs.The SCD algorithm is implemented in
CATALYST as a three-step procedure: i) preliminary barcode assignment (
assignPrelim()); ii) automated estimation of sample-specific separation cutoffs (
estCutoffs()); and, iii) application of cutoffs to arrive at final barcode assignments (
applyCutoffs()).
Preliminary barcode assigment
For our dataset, a 6-choose-3 = 20 barcoding scheme was used (
Figure 5). Five barcodes were unused (
empty_1-5), resulting in 15 samples (9 references, 6 samples of interest). We first read the corresponding
debarcoding_scheme.csv into R:
Figure 5.
6-choose-3 palladium isotope debarcoding scheme.
Rows correspond to palladium isotopes (barcode channels), columns to barcode identifiers (samples). Each sample is negative (white) or positive (grey) for 3 out of 6 barcode channels, resulting in 20 unique barcode combinations.
6-choose-3 palladium isotope debarcoding scheme.
Rows correspond to palladium isotopes (barcode channels), columns to barcode identifiers (samples). Each sample is negative (white) or positive (grey) for 3 out of 6 barcode channels, resulting in 20 unique barcode combinations.During this first debarcoding step, each event is preliminarily assigned to a barcode according to its top-
k expressed barcode channels. Here, events whose expression is highest for a combination of barcode channels that does
not appear in the debarcoding scheme (
bc_key) will be given barcode ID 0 (for “unassigned”). Thus, we can remove empty barcodes from the
bc_key variable such that events assigned to these barcodes are left unassigned from the start. Alternatively, one could perform debarcoding using the non-subsetted key, and filter out empty barcodes downstream.For preliminary barcode assignment, we use
CATLAYST’s assignPrelim() function, providing the input data (
sce) and debarcoding scheme (
bc_key). If not specified otherwise,
assignPrelim() will default to using the
exprs assay slot (argument
assay). Because we ran
normCytof() with
overwrite = FALSE, this assay contains arcsinh-transformed
raw counts; we set
assay = "normexprs" in order to use the normalized values instead:In the returned SCE, feature metadata (
rowData) column
is_bc indicates whether or not a channel corresponds to a barcode channel:For each event, barcode identifiers are stored in
colData column
bc_id. After this preliminary round of assignment, 57980/337525 events (17.18%) have been left unassigned:Furthermore, for each cell, the barcode channel expressions are scaled relative to the 95th expression percentiles of its respective barcode population. The scaled data is stored in assay slot
scaled. Based on these scaled barcode channel intensities, a separation value is computed as the distance between highest negative and lowest positive barcode channel; separations are stored in
colData column
delta.
Estimation of separation cutoffs
To decide on separation cutoffs, we consider yields upon debarcoding as a function of the applied cutoff (
Figure 6). Commonly, this function will be characterized by an initial weak decline, where doublets are excluded, and subsequent rapid decline in yields to zero. In-between, low numbers of counts with intermediate barcode separation give rise to a plateau. Ideally, the applied separation cutoffs should provide a balance between high cell yield and low assignment uncertainty, marking the approximate midpoint of the yield function’s plateau region.
Figure 6.
Yield plot for a 6-choose-3 debarcoding scheme.
Shown is the distribution of barcode population separations (histogram) and cell yields by sample (lines) as a function of the applied separation cutoff.
Left axis corresponds to cell yield in percent; right axis to the total number of cells.
Yield plot for a 6-choose-3 debarcoding scheme.
Shown is the distribution of barcode population separations (histogram) and cell yields by sample (lines) as a function of the applied separation cutoff.Left axis corresponds to cell yield in percent; right axis to the total number of cells.Instead of a single global cutoff, we estimate a sample-specific cutoff to account for barcode population yields that decline in an asynchronous fashion. To this end, we fit both a linear and a three-parameter log-logistic model to each yield function. For the linear fit, we estimate the cutoff as the value for which yields have declined to 50%. For the log-logistic fit, we compute the cutoff as the value for which there is minimal yield decline by minimizing each yield function’s 1st derivative. For each barcode, the final cutoff estimate is computed as the mean of both estimates, weighted with the goodness (residual sum of squares) of each fit (see
Methods for details). Thus, the choice of thresholds for the distance between negative and positive barcode populations is: i) automated and ii) independent for each barcode. Nevertheless, reviewing barcode-specific yield plots and, in rare cases, refining the estimated separation cutoffs is advisable (see
Figure 7).
Figure 7.
Yield plot for an exemplary sample, including the estimated separation cutoff.
Shown is the distribution of barcode population separations (histogram) and cell yields (line) for the sample as a function of the applied sample-specific separation cutoff. Left axis corresponds to cell yield in percent; right axis to the total number of cells.
Cutoff estimation is performed by
CATALYST’s
estCutoffs() function, which takes as input a SCE as returned by
assignPrelim(); that is, preliminary barcode assignments are required to estimate separation cutoffs.
estCutoffs() will store sample-specific cutoff estimates under
metadata slot
sep_cutoffs, but will leave barcode assignments unchanged.We can visually inspect the estimated cutoffs using
plotYields() with argument
which specifying the barcode ID of interest (
Figure 7). In our example, the cutoff estimate nicely marks the midpoint of the yield function’s plateau or, equivalently, the valley between peaks of cell yields. To decrease the stringency of the applied cutoff, and thus increase the resulting cell yield, we could set the sample’s cutoff to e.g. 0.1. Vice versa, a more stringent cutoff of e.g. 0.2 would decrease the cell yield but yield a purer population.As an alternative to inspecting the cutoff estimate for each sample in R, we could specify
which = bc_ids to obtain a list of yield plots for all barcodes; the generated set of plots may be written to a single PDF file via providing
plotYields() with an
out_path to allow for easy reviewing of the separating cutoffs currently stored within the object.
Yield plot for an exemplary sample, including the estimated separation cutoff.
Shown is the distribution of barcode population separations (histogram) and cell yields (line) for the sample as a function of the applied sample-specific separation cutoff. Left axis corresponds to cell yield in percent; right axis to the total number of cells.Besides a cutoff on the separation between positive and negative barcode populations, to trim outliers, the SCD algorithms applies an additional cutoff on the
Mahalanobis distance (argument
mhl_cutoff), a metric that quantifies the distance of a given event to the expression distribution of the barcode population it has been assigned to.In
Figure 6, we can observe that population yields decline synchronously with increasing separation cutoffs, and that we might consider applying a global separation cutoff, e.g., at
∼ 0.15. For this data, yields are in fact similar, independent of whether we apply sample-specific cutoffs or a single global one. Nevertheless, applying sample-specific cutoffs is recommended in order to maximize cell yields while minimizing uncertainty in barcode assignments.After debarcoding, we compare the number of events assigned to each barcode population before and after applying separation cutoffs, and filter out events that have been left unassigned (barcode ID 0). As shown in
Figure 8, after applying the separation cutoffs, the number of unassigned cells (0) increases, while the number of cells assigned to each barcoding well decreases. We also observe a higher decrease in assigned cells for tumor samples, which underwent a dissociation process and contain more debris. Conversely, highly viable cell lines and PBMCs have a higher recovery yield.
Figure 8.
Barplot of cell counts before (black) and after (grey) applying separation cutoffs.
Compensation
Mass cytometry utilizes heavy metals (usually from the lanthanide series) as reporters to label antibodies. As a result, channel crosstalk originating from spectral overlap and autofluorescence is significantly less pronounced in mass cytometry compared to flow cytometry. Yet, spillover due to abundance sensitivity, isotopic impurities, and oxide formation still exists, giving rise to artefactual signal that can impede data interpretability.A combined experimental-computational pipeline to correct for spillover in mass cytometry data has been proposed by Chevrier
et al.
and is implemented in the
package. In brief, compensation is achieved via the following three-step approach outlined here (see for details).Identification of single positive populations via deconvolution of single-stained beads (
assignPrelim(),
estCutoffs(),
applyCutoffs()).Estimation of a spillover matrix (SM) from the populations identified (
computeSpillmat()).Compensation via multiplication of measurement intensities by the SM’s inverse, the compensation matrix (
compCytof()).We will apply a pre-acquired spillover matrix (metadata file
spillover_matrix.csv). Thus, we enter at step 3, which involves only compensating the input dataset using
CATALYST’s
compCytof() function. By default,
compCytof() will reuse the cofactor stored in
int_metadata(sce)$cofactor for computing arcsinh-transformed data from the compensated counts, thus applying the same transformation as during data preparation and normalization:To visually inspect how compensation affects signal intensities, we can generate scatter plots pre- and post-compensation; an exemplary pair of channels is shown in
Figure 9. In such a plot, we can observe a slight positive association between the signals of spill-affected channels, which should be removed upon compensation.
Figure 9.
Scatter plots for two exemplary channels before (left) and after correction for spillover (right).
Gating
Many events acquired in mass cytometry may in fact be debris, doublets or dead cells, and should be filtered out through a gating step. Here, we suggest a strategy that first applies an elliptical gate on cell events, defined as double positive for the DNA channels Ir191/Ir193. This allows the exclusion of debris and doublets. As a second step, we discard cells that are positive for the dead cell marker Pt194.These two steps are performed using the
R package
, and the resulting gates are visualized on scatter plots of the channels subjected to gating using
. For consistent visualization, we again define a common plotting theme for scatter plots of channels
chs that include the gating boundaries for the specified
gate_id:
Gating on cells
In order to apply sample-specific gates, we first convert the SCE into a
flowSet with a separate frame for each sample (argument
split_by = "bc_id"). As gating should be performed on expression-like data (not ion counts), we further specify
assay = "exprs" to retain the arcsinh-transformed assay slot. Thirdly, since conversion from SCE to
flowCore data structures requires matrix transposition (rows correspond to targets in the SCE, but to events in
flowFrame/Sets), we retain only those channels that are relevant when gating of (live) cells: DNA and dead channels, whose indices are stored in variables
dna and
live.We apply an elliptical gate (
gating_method = "flowClust.2d") to exclude the two lowest density percentiles (
quantile = 0.98). Because the input gating set contains a separate frame for each barcode, the gate will be computed separately for each sample. In case of a single DNA channel (e.g., Rh103), one-dimensional gates (i.e., thresholds on minimum and maximum values) would be applicable instead.We use
to produce scatter plots of the DNA channels, with
geom_gate("cells") to visualize the gates computed above (
Figure 10):
Figure 10.
Scatter plots of DNA channels, split by sample and including elliptical cell gates.
Gating on live cells
The wrapper function
.live_gate() defines a polygonal gate comprised of a line and a bivariate standard normal density Z, such that cells pass gating when i) their expression is within the
qth quantile of Z; and, ii) their expression falls below a line parameterized by intercept
i and slope
s. In this way, the gate is centered around the expression peak, while excluding cells whose
dead channel intensities increases with DNA content.In contrast to the cell gates above, we apply live gates with sample-specific gating parameters. To this end, we specify a list
l containing quantiles
q, intercepts
i and slopes
s for each sample. These parameters are updated iteratively to remove dead cells while retaining cell yields as high as possible (
Figure 11). After manual adjustments, we arrive at the following sample-specific gating parameters:
Figure 11.
Scatter plots of DNA and dead cell channels, split by sample and including the live cell polygon gates.
We display the yield of
"cell" and
"live" gates on each samples to quickly assess the cell losses occurring at the two gating steps (
Figure 12). As expected the
"cell" gate leads to a systematic loss of around 1% of cells across all the samples. The
"live" gate leads to a stronger reduction of cell yield in the tumor samples, consistent with the fact that those samples, which underwent enzymatic dissociation, contain more dead cells.
Figure 12.
Barplot of cell and live gating yields.
For each barcode ID (x-axis), frequencies are relative to the total number of cells in the population before gating; bars are colored by gate ID.
Barplot of cell and live gating yields.
For each barcode ID (x-axis), frequencies are relative to the total number of cells in the population before gating; bars are colored by gate ID.We extract a logical vector indicating whether a given event is included in or excluded by the
"live" gate applied above by applying
gh_pop_get_indices to each sample in
gs. Secondly, we extract the cell indices from
gs and subset the SCE to keep only cells that passed the
"live" gate.Finally, we can again visualize scatter plots of dead channels against DNA as a quality control for the retained subset of cells (
Figure 13).
Figure 13.
Scatter plots of dead cell channel against DNA, including the subset of cells remaining after live cell gating.
Quality control
Having completed the standard preprocessing steps, we proceed to investigate how the current experiment compares to prior experiments in terms of the number of cells in each reference and sample, and the expression levels of each target. Large parts of the metadata generated by now may no longer be needed, and unnecessarily increases output file sizes for large-scale datasets. Therefore, we will retain only two key cell metadata variables:
sample_id containing the FCS filename each cell originates from, and
bc_id containing the barcode population assignments. We secondly rename these variable to make the following quality control steps more intuitive.In the debarcoding scheme used for deconvolution of the multiplexed samples (Section
Debarcoding), barcode identifiers were chosen to contain all information relevant for each sample. This setup allows us to extract sample metadata directly from the
bc_ids. Alternatively, and especially for more complex experimental designs, this information could be stored in a separate metadata table. Such a table could then be used to match the
bc_ids with the listed samples, and add arbitrary metadata information (e.g., batch, patient ID, treatment).In our example, barcode identifiers include each sample’s type (
CellLine,
PBMC or
Tumor), group (
R for reference or
S for sample of interest), and replicate number; and follow a consistent naming scheme: “_”. We can easily extract these components and store them in the SCE’s cell metadata (
colData):
Quality control (QC) on reference cell counts
As a first quality control, we compare the cell counts of each reference sample (
R) to those obtained from 7 previous experiments (
Figure 14). Since the references are obtained from pre-barcoded aliquots of cells, the number of reference cells acquired gives direct information regarding the cell yield throughout the whole experiment: From cell barcoding to acquisition on the CyTOF. As shown in
Figure 14, the current experiment tends to have a lower yield compared to average experiments.
Figure 14.
Reference cell count quality control.
Boxplot comparing the reference cell counts obtained for the current experiment (red crosses) to those from 7 previously acquired experiments.
Reference cell count quality control.
Boxplot comparing the reference cell counts obtained for the current experiment (red crosses) to those from 7 previously acquired experiments.
QC on sample cell counts
Secondly, we compare the cell counts for the 4 samples of interest (2 PBMC, 2 tumor samples) to the number of cells recorded for 14 tumor and PBMC samples each (28 samples in total) acquired in previous experiments (
Figure 15). This step provides a first quality assessment of the samples of interest. Here, samples with too few cells will be less reliable, and potentially less representative of the original tissue, making conclusions from downstream analyses more difficult to draw.
Figure 15.
Sample cell count quality control.
Boxplot comparing the sample cell counts obtained for the current experiment (red crosses) to those from 7 previously acquired experiments.
Sample cell count quality control.
Boxplot comparing the sample cell counts obtained for the current experiment (red crosses) to those from 7 previously acquired experiments.
QC on mean marker intensities
As the third and final quality control, we compare the 98th expression quantiles across all targets of interest over the pooled references to those obtained from 7 previously acquired experiments available in metadata table
ref_marker_levels.csv (
Figure 16). We chose the 98th percentile to account for the fact that some populations are rare, and we are particularly interested in assessing signal stability for positive cells rather than the median of the population. Since the pooled references are identical from one experiment to another, this gives a direct indication of the current experiment’s staining efficacy and enables early identification of antibody degradation over time.
Figure 16.
Mean marker expression quality control.
Boxplot comparing the mean marker expression obtained for the current experiment (shown in red) to those from 7 previously acquired experiments.
Mean marker expression quality control.
Boxplot comparing the mean marker expression obtained for the current experiment (shown in red) to those from 7 previously acquired experiments.
Batch alignment
Each CyTOF experiment contains the same set of references. Similar to the approach used by Schuyler
et al.
, we use these references as anchors to calculate a channel-specific correction factor by dividing the 98th percentile measured in the current experiment by the average 98th percentile obtained across the first seven experiments of the project. The signal observed in each channel for the samples of interest is then divided by these correction factors derived from the reference samples.To visually assess the effect of the batch correction applied above, we compare the expression distributions before and after scaling (
Figure 17). We additionally include 98th expression percentiles of both the (un)corrected samples as well as of the references used for computing correction factors. Percentiles are aligned with the references’ upon correction while, even for the most affected channels (largest deviation from the references and, consequently, highest batch correction factors), distributions are very similar before and after scaling.
Figure 17.
Batch alignment quality control.
Expression distributions before (blue) and after (red) quantile scaling using 7 previously acquired experiments as reference. Included are the 6 most affected channels (i.e., highest absolute correction factors). Dashed lines indicate 98th expression percentiles averaged across references; points represent the respective distributions’ 98th percentiles.
Batch alignment quality control.
Expression distributions before (blue) and after (red) quantile scaling using 7 previously acquired experiments as reference. Included are the 6 most affected channels (i.e., highest absolute correction factors). Dashed lines indicate 98th expression percentiles averaged across references; points represent the respective distributions’ 98th percentiles.
Discussion
In this workflow, we have presented a pipeline for reproducible and highly-automated preprocessing of CyTOF data, based on an updated version of
CATALYST. Our pipeline covers four standard steps: Normalization for signal time-drift using bead standards (Section
Normalization), single-cell deconvolution of multiplexed samples (Section
Debarcoding), correction for spillover via compensation (Section
Compensation), and gating for live cells (Section
Gating). Moreover, we have included various quality control steps that compare the current experiment to a set of reference data (Section
Quality control). These steps ensure that measurement sensitivity, gating cell yields, sample cell counts, and expression levels lie within the expected range.A key advantage of both using and developing Bioconductor packages is that they utilize common data structures, thereby greatly facilitating interaction between them. For example, many of the data structures used in scRNA-seq data analysis have only become established relatively recently. Meanwhile, the cytometry community has been relying on the FCS file format for data storage, and
’s
flowFrame/flowSet as well as
’s
GatingSet classes for computational analyses. While there exists a lot of infrastructure around these data structures, they impede method development for newly emerging standards, and act as a barrier for interpolation of analyses across currently developed packages. This is particularly visible in the context of other fast growing single-cell data types such as scRNA-seq data analysis, where most current methods are being developed around Bioconductor’s
SingleCellExperiment class. To name just two examples, an extensive collection of visualization tools for SCEs is available through
, including a variety of dimensionality reduction methods; and methods for differential abundance (DA) analysis (to detect subpopulations that are differently abundant between conditions) and differential state (DS) analysis (to test for subpopulation-specific expression changes across conditions) are implemented in
.The SCE class allows storing multiple
assays that can, for example, contain raw counts, expression-like data obtained upon arcsinh-transformation, as well as any intermediate data matrices obtained after normalization, compensation and batch correction. Moreover, any event (cell) and feature (marker) metadata generated in the process can be added to the object’s
colData/rowData, alongside an arbitrary number of dimensionality reductions. Thus, SCEs present an overall more compact and less error-prone data structure for both preprocessing and downstream analysis when compared to storing the various data matrices or metadata in separate variables, which would have to be combined for certain computations, separately subsetted and saved to independent outputs.There is an obvious benefit for the mass cytometry community to take advantage of these new infrastructure developments. However, it is equally important to maintain backward compatibility with well-established standards in the field. For example, it can be desirable to write out intermediate outputs (FCS files) after each proprocessing step, or make use of available tools that build around
flowCore’s
flowFrame and
flowSet classes, or other classes derived thereof (e.g.,
’s
GatingSet). Thus, while
CATALYST’s transition to a more recent and an arguably advantageous data structure is motivated by the ability to leverage many existing and newly-developed tools, a complete dismissal of the large infrastructure that is available in the realm of cytometry data analysis is impossible at this time. To facilitate conversion between SCEs and conventional cytometry data structures,
CATALYST provides the
sce2fcs() function, which allows the user to specify which assay data to retain, whether to drop or keep available cell metadata and dimensionality reductions, and (optionally) to split the input dataset by a non-numeric variable (to, e.g., export each sample to a separate FCS file).Although the current version of this pipeline constitutes a comprehensive approach to generate high-quality data for downstream analysis, further developments could be added in the future. In particular, it could be useful to implement an automated way of identifying and removing part of the data with unstable signal, similar to the approach proposed by
, an R package designed to exclude fluorescent anomalies in flow cytometry data. Given that selection of anomalies in the dataset by the user is subjective, or that they may be altogether undetectable by eye, the advantage of such an approach would be to further standardize the process while decreasing manual work.Recently, batch normalization has become of increased importance in order to enable integration of datasets acquired at different times, by different users and on different instruments. Here, we use scaling normalization where references are used as anchors to correct all samples included in the analysis in a channel specific way, similar to the strategy proposed by Schuyler
et al.
. While this approach requires a well-defined experimental procedure where references with positive and negative subsets for each marker have to be included in each experiment, it does not make any assumptions on sample compositions. Thus, since the dataset used in this pipeline was acquired on the same instrument and stained with the same frozen antibody panel as previous experiments, scaling by expression quantiles provides an efficient way to correct for batch effects.To increase the flexibility of batch correction in cases where the experimental variation is higher,
CATALYST could integrate different methods that have the potential to increase batch correction efficiency. For example,
computes quantiles for every metacluster and for every marker after aggregation of control samples from each batch. Such an approach could be more appropriate in cases where the references’ expression distributions are less aligned. An alternative method,
, exploits the concept of pseudo-replicates to remove unwanted variation (RUV) between proteins and cells. Here, cells are grouped into subpopulations using
clustering. Groups of cells present across all batches are considered to be pseudo-replicates, and their deviation (residuals) from the average signal across batches is used to estimate and correct for the batch effects.Although various methods to correct for batch effects in both the presence and absence of references have been proposed, a systematic comparison of batch correction tools for mass cytometry data is missing. Thus, whether the approach used in this study to align batches on the basis of shared references is the most accurate remains unresolved.Our pipeline is entirely R-based and does not rely on switching between platforms. Thus, it omits the need for heavy data transfers between online cloud services, graphical user interfaces (GUI), and programming environments for different parts of preprocessing and downstream analysis. As a result, each step in the analysis is fully reproducible and any parameters used throughout can be easily modified and documented. This transition from manual, GUI-based to largely automated, programmatic data processing is indispensable for clinical and other large-scale studies, where sample throughput is high and reproducibility ever so important.Since its first submission to Bioconductor in 2017,
CATALYST has undergone continuous maintenance and development. The most noteworthy changes include implementation of a comprehensive visualization suite based on Nowicka
et al.
’s workflow for differential discovery; and, the transition from custom data structures to using Bioconductor’s
SingleCellExperiment class for differential analysis with Bioconductor v3.11, and for preprocessing with v3.12. Taken together, these developments have transformed
CATALYST into a one-stop tool for cytometry data analysis, enabling both data preprocessing and in-depth downstream analysis.
Methods
Commonly, bead events are identified by manual gating on scatter plots of DNA vs. bead channels where DNA should be low, and expression should be high across all bead channels. Instead, we propose a programmatic way to identify beads that includes detection of bead-bead and cell-bead doublets.Our normalization strategy leverages the already established SCD algorithm for preliminary tagging of events as beads. In this context, the debarcoding scheme is a 2×(2+
m) matrix (
Figure 18). Here, columns correspond to the two DNA channels and
m barcode channels; rows correspond to barcodes 0 (no bead) and 1 (is bead), where non-bead events are positive for DNA channels only (barcode 11000. . . ), while bead events are negative for DNA and positive for all bead channels (barcode 00111. . . ):
Figure 18.
Schematic of the debarcoding scheme used by ‘CATALYST‘’s ‘normCytof()‘ function to identify bead events.
Rows correspond to barcodes, columns to DNA and bead channels. Each barcode is either positive (grey) or negative (blank) for a given channel; cells (barcode 11000...) are positive for DNA and negative for bead channels, bead events (barcode 00111...) are negative for DNA and positive for bead channels.
Schematic of the debarcoding scheme used by ‘CATALYST‘’s ‘normCytof()‘ function to identify bead events.
Rows correspond to barcodes, columns to DNA and bead channels. Each barcode is either positive (grey) or negative (blank) for a given channel; cells (barcode 11000...) are positive for DNA and negative for bead channels, bead events (barcode 00111...) are negative for DNA and positive for bead channels.Upon initial assignment of bead events, we apply a
median ±
n median absolute deviation (MAD) rule to remove low- and high-signal events from the bead population used for estimating normalization factors. As
n decreases, bead populations become more narrow and bead-bead doublets are excluded. The extent to which bead populations are trimmed can be adjusted via argument
trim (default 5).Notably, slight
over-trimming does not affect normalization. It is therefore recommended to choose a
trim value that is small enough to assure removal of doublets at the cost of reduced bead population sizes.To correct for the effect of event acquisition time on signal intensity, we follow the method proposed by Finck
et al.
. In essence, bead intensities are smoothed using a median sliding-window with width
k (default 500 bead events). At each timepoint, the slope of a line with intercept zero is computed by minimizing the squared error between smoothed bead and mean bead intensities (= baseline). Alternatively, a reference set of beads from which to compute the baseline can be provided. Slopes for non-bead timepoints are obtained via constant interpolation of these slopes. Here, large slopes correspond to significant deviation from the baseline, and small slopes indicate that the signal is already similar to the baseline. Thus, raw bead counts are normalized by multiplication with the fitted slopes at each timepoint.The debarcoding process commences by assigning each event a preliminary barcode ID. This requires specification of a binary barcoding scheme (or debarcoding key)where
i = 1, ...,
n is the barcode index,
j = 1, ...,
m a barcode channel, and
n,
m denote the number of unique barcodes and barcoding channels, respectively. Further, let
k
denote the number of positive barcoding channels for barcode
l:If
k
=
k ∀
i = 1, ...,
n (i.e., every barcode has the same number of positive barcoding channels), the
k channels with the highest signal in a given event are considered to be positive, the remaining
m − k to be negative. The
separation δ of positive and negative events is then defined as the difference between the
kth highest and (
m −
k)th lowest scaled intensity for that event.When the separation between positive and negative barcoding channels is low, we cannot be confident in the barcode assignment.For the estimation of cutoff parameters, we consider yields upon debarcoding as a function of the applied cutoffs. Commonly, this function will be characterized by an initial weak decline, where doublets are excluded, and subsequent rapid decline in yields to zero. In between, low numbers of counts with intermediate barcode separation give rise to a plateau. To facilitate robust estimation, we fit a linear and a three-parameter log-logistic function
to the yields function with
’s
LL.R function
(
Figure 19). As an adequate cutoff estimate, we target a point that marks the end of the plateau regime and on-set of yield decline to appropriately balance confidence in barcode assignment and cell yield.
Figure 19.
Schematic description of automated separation cutoff estimation.
Bar graphs represent the distribution of cells relative to the barcode distance, dotted line scorresponds to yield upon debarcoding as a function of the applied separation cutoff. The yield curve is fitted with a linear regression (blue) and a three parameter log-logistic function (red). The final cutoff estimate (black dashed line) is defined as the mean of estimates derived from both fits, weighted with the goodness of the respective fit.
Schematic description of automated separation cutoff estimation.
Bar graphs represent the distribution of cells relative to the barcode distance, dotted line scorresponds to yield upon debarcoding as a function of the applied separation cutoff. The yield curve is fitted with a linear regression (blue) and a three parameter log-logistic function (red). The final cutoff estimate (black dashed line) is defined as the mean of estimates derived from both fits, weighted with the goodness of the respective fit.We define the linear model cutoff estimate
c
LM as the value for which the cell yield
Y has declined to half of the initial Yield
β
0:where
β
0,
β
1 are the intercept and slope of the linear model fit, respectively.We define the log-logistic model cutoff estimate
c
LLM as the value for which the log-logistic function’s decline is minimized relative to its value:The final cutoff estimate
c is defined as the weighted mean between these estimates:where
w is the goodness of the linear fit relative to the log-logistic fit:As in conventional flow cytometry, we model spillover linearly, with the channel stained for as predictor, and spill-effected channels as response. Thus, the intensity observed in a given channel
j are a linear combination of its real signal and contributions of other channels that spill into it. Let
I denote the (unknown) real and
J the observed signal. Further, let
s
be the proportion of channel
j signal that is due to channel
i, and
w
the set of channels that spill into channel
j. ThenIn matrix notation, measurement intensities may be viewed as the convolution of real intensities and a spillover matrix
where
n denotes the number of samples (cells) and
p the number of features (channels):
J =
I ·
SM. The real signal
I can then be retrieved via:where
SM
−1 is termed compensation matrix (
CM).While mathematically exact, the solution to this equation will yield negative values, and does not account for the fact that ion counts are strictly non-negative. A computationally efficient way to adress this is to instead use non-negative linear least squares (NNLS), which optimizes the least squares criterion under the constraint of non-negativity:To solve for
I, we apply the Lawson-Hanson algorithm
for NNLS implemented in the
nnls package.Because any signal not in a single stain experiment’s primary channel
j results from channel crosstalk, each spill entry
s
can be approximated by the slope of a linear regression with channel
j signal as the response, and channel
i signals as the predictors, where
i ∈
w
.
computeSpillmat() offers two alternative ways for spillover estimation (
20).
Figure 20.
Population versus single-cell based spillover estimation.
In a population-based setting (left), spillover is estimated as the slope of a line through the centers of positive (red) and negative (blue) populations. In a single-cell based setting (right), slopes are computed independently for each cell in the positive population, and spillover is estimated as their median.
Population versus single-cell based spillover estimation.
In a population-based setting (left), spillover is estimated as the slope of a line through the centers of positive (red) and negative (blue) populations. In a single-cell based setting (right), slopes are computed independently for each cell in the positive population, and spillover is estimated as their median.The
default method approximates this slope with the following single-cell derived estimate: Let
i
+ denote the set of cells that are positive in channel
i, and
be the channel
i to
j spill computed for a cell
c that has been assigned to this population. We approximate
as the ratio between the signal in unstained spillover receiving and stained spillover emitting channel,
I
and
I
, respectively. The expected background in these channels,
and
, is computed as the median signal of events that are i) negative in the channels for which spill is estimated (
i and
j); ii) not assigned to potentionally interacting channels; and, iii) not unassigned, and subtracted from all measurements:Each entry
s
in SM is then computed as the median spillover across all cells
c ∈
i
+:In a population-based fashion, as done in conventional flow cytometry,
s
is computed as the slope of a line through the medians (or trimmed means) of stained and unstained populations,
and
, respectively. Background signal is computed as above and subtracted, according to:On the basis of their additive nature, spill values are estimated independently for every pair of interacting channels. Hereby, we take into account only interactions that are sensible from a chemical and physical point of view:
M ± 1 channels (abundance sensitivity),
M + 16 channels (oxide formation), and channels measuring isotopes (impurities;
Figure 21).
Figure 21.
Heatmap of channel interactions expected to exhibit spillover.
Included are only interactions that are sensible from a chemical and physical point of view: adjacent mass channels (abundance sensitivity), +16 mass channels (oxidation), and channels measuring isotopes (impurities).
Heatmap of channel interactions expected to exhibit spillover.
Included are only interactions that are sensible from a chemical and physical point of view: adjacent mass channels (abundance sensitivity), +16 mass channels (oxidation), and channels measuring isotopes (impurities).Alternatively,
interactions = "all" will compute a spill estimate for all
n · (
n − 1) possible interactions, where
n denotes the number of measurement parameters. Estimates falling below the threshold specified by
th will be set to zero. Lastly, note that diagonal entries
s
= 1 for all
i ∈ 1, ...,
n, so that spill is relative to the total signal measured in a given channel.
Data availability
Underlying data
The CyTOF data as well as all metadata required to run the full pipeline presented herein are available from Figshare as well as the Tumor Profiler website at
https://tu-pro.ch/download/catalyst.Figshare: An R-based reproducible and user-friendly preprocessing pipeline for CyTOF data.
https://doi.org/10.6084/m9.figshare.c.5063984.v1This project contains the following underlying data:CyTOF_acquisition_1-3.fcs (40-Ab panel CyTOF data of 2 blood and 2 tumor samples, and 9 reference samples selected to contain positive and negative populations for each marker included in the study’s Ab-panel. Samples were multiplexed with a 20-well barcoding plate, and obtained from a single experiment provided as 3 FCS files.)normalization_beads.fsc (Beads identified using ‘CATALYST‘ during the normalization step of a previous CyTOF experiment. – Used as reference beads to correct for changes in signal sensitivity over time across multiple CyTOF experiments.)ref_bead_counts.csv (A table of mean dual counts for the six different bead channels (columns) obtained from 7 previous experiments (rows). – Used as a reference to assess the measurement sensitivity for the current experiment.)debarcoding_scheme.csv (A binary barcoding scheme of 6-choose-3 = 20 barcodes with columns corresponding to barcode channel masses (101, 104, 105, 106, 108, 110) and rows corresponding to barcodes (7 empty, 9 references, 2 PBMC and 2 tumor samples) – Used for single-cell deconvolution of multiplexed of samples.)spillover_matrix.csv (A spillover matrix calculated with ‘CATALYST‘ from beads single-stained with each of the 40 antibodies included in the panel used in this study. The matrix contains, for each measurement channel (rows), the percentage of signal received by all other channels (columns). – Used for correction of spillover.)ref_cell_counts.csv (A table of the number of cells measured in 7 previous experiments, each including 4 cell line, 3 PBMC and 2 tumor references samples (63 samples in total). – Used to assess reference sample cell yields in the current in comparison to previous experiments.)sample_cell_counts.csv (A table of the number of cells measured in 7 previous experiments, including 2 PBMC and 2 tumor samples (28 samples in total). – Used to assess sample cell yields in the current in comparison to previous experiments.)ref_marker_levels.csv (A table of the 98th expression percentiles for each target (columns) across 7 previous experiments (rows). – Used to assess the staining efficiency of the current experiment.)Data are available under the terms of the [Creative Commons Attribution 4.0 International license](
http://creativecommons.org/licenses/by/4.0} (CC-BY 4.0).
Software availability
Analyses were run in R v4.2.0
, with Bioconductor v3.15
, and all software packages used throughout this workflow are publicly available through the Comprehensive R Archive Network (
https://cran.r-project.org) or the Bioconductor project (
http://bioconductor.org). Specific package versions are captured in the following session information:
Consent
Written informed consent for publication of the tumor and blood samples was obtained from the patients (BASEC-Nr.2018-02050, approved by the Kantonal Ethics Commisions of Zurich and Basel).Thanks for the updated version and for assessing the suggestions and issues mentioned in the first report.I think this revised version is easier to follow and clearer with the updates provided.One last comment would be on the batch alignment section, the Figure 17 selected is not showing the 7 different markers distribution to compare the expression distributions before and after batch correction and I think it would be useful in the assessment of this correction.Is the rationale for developing the new method (or application) clearly explained?YesIs the description of the method technically sound?YesAre the conclusions about the method and its performance adequately supported by the findings presented in the article?YesIf any results are presented, are all the source data underlying the results available to ensure full reproducibility?YesAre sufficient details provided to allow replication of the method development and its use by others?PartlyReviewer Expertise:I work in Bioinformatics and especially in the normalization and batch correction of CyTOF datasets.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.In their manuscript, Crowell & Chevrier et al. present a novel workflow to preprocess mass cytometry (CyTOF) datasets in R. The presented pipeline is a useful update on earlier publications and packages, including the authors' CATALYST package which is clearly stated. Overall, I find this to be a valuable tool that brings together different functionalities into a unified workflow that enables reproducible and comprehensive preprocessing of this data type. The different steps and approaches are well described and illustrated. Especially, the inclusion of a functionality to perform live cell gating without having to switch platforms is much appreciated, although its current implementation could be improved:
In addition, testing this pipeline on some in-house generated data, a few minor issues occurred which should be addressed:Gating on cells: Cells are first identified using an elliptical gate to exclude the two lowest density percentiles. Firstly, this plot relies on two DNA channels (whose information is likely redundant) and wasn’t directly applicable to alternative DNA stains (e.g. rhodium). Furthermore, I am wondering whether this approach might exclude for example a fraction of cycling cells or preferentially exclude cell types or states with increased chromatin accessibility and therefore higher DNA signal?Gating on live cells: The approach suggested by the authors worked well on my test data, however, it takes a while to manually adjust values for every file to fit the gates closely to the data. While I see the value of automating this step, I also think that some manual gating could simplify the process and further increase downstream data quality. Potentially, the authors could adopt an approach like the gate_draw function from the CytoRSuite library.Compensation: The workflow includes compensation as a preprocessing step which the authors have shown in separate publications to improve data quality, but which is currently not routinely performed by many researchers working using mass cytometry. I, therefore, assume that most users of this pipeline would be relying on published spillover matrices that reflect estimates of isotope purity and oxidation. While I agree with the usefulness of this function, I believe that adding additional quality control functions could improve acceptance of and trust in this approach. For example, in flow cytometry, overcompensation is often easily spotted by the occurrence of overly negative values, however, using their NNLS approach this is not readily apparent in compensated mass cytometry data. It would be very helpful to have a quality metric that would alert users to such potential issues introduced by the compensation step.While this might only be needed in rare cases, a function to rename channels and potentially match these names across multiple fcs files could enhance the adaptability of this package. In my test case, conflicting channel names prevented the import of the files into the workflow. In other cases, it might help to match channel names between batches. The authors could look to the premessa package for inspiration.The authors have incorporated various options for DNA channels which is much appreciated. My test data had been stained with a rhodium intercalator. Specifying this worked well, only the res$scatter function seems to ignore this choice and instead seems to default to iridium DNA intercalators.Sample specific debarcoding is appreciated. Figure 6 and the plotYields function return a debarcoding percentage. I believe this percentage refers to percent of initial assignments, but it is not specifically stated. It might be helpful to get a feeling of the percentage of cells (out of total cells) that are assigned after refining the initial assignments.Is the rationale for developing the new method (or application) clearly explained?YesIs the description of the method technically sound?YesAre the conclusions about the method and its performance adequately supported by the findings presented in the article?YesIf any results are presented, are all the source data underlying the results available to ensure full reproducibility?YesAre sufficient details provided to allow replication of the method development and its use by others?YesReviewer Expertise:Single cell proteomics, Mass Cytometry, Immunology, Stem Cell biologyWe confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.Minor comments:Gating on cells: Cells are first identified using an elliptical gate to exclude the two lowest density percentiles. Firstly, this plot relies on two DNA channels (whose information is likely redundant) and wasn’t directly applicable to alternative DNA stains (e.g. rhodium). Furthermore, I am wondering whether this approach might exclude for example a fraction of cycling cells or preferentially exclude cell types or states with increased chromatin accessibility and therefore higher DNA signal?The first gating step is indeed performed on two DNA channels which contains redundant information. However, this approach is commonly used in the mass cytometry field to exclude debris and cell doublets. By modifying the quantile and the target value defining the center of the ellipse, the user can control how many cells are excluded from the gate and ensure that most cycling cells are kept in the analysis.To gate on alternative DNA stains, a different pair of channels could be assigned to the “dna” variable in the corresponding code chunk. In the case of a single DNA channel, a one-dimensional gating (i.e., thresholding) could be applied (as opposed to the currently used elliptical gate). We have added a comment mentioning this to the text under “Gating on cells”.Gating on live cells: The approach suggested by the authors worked well on my test data, however, it takes a while to manually adjust values for every file to fit the gates closely to the data. While I see the value of automating this step, I also think that some manual gating could simplify the process and further increase downstream data quality. Potentially, the authors could adopt an approach like the gate_draw function from the CytoRSuite library.Indeed, the approach depicted in this paper works well in cases where a limited number of samples are included in a run and when the live/dead cell profile is well defined and consistent between samples. The process can indeed be tedious when hundreds of samples are included in a run or when the live/dead cell profile is more complex. In the latter case, including a function similar to gate_draw function from CytoRSuite could be helpful. However, we here aimed at proposing an automated pipeline; manual gating would defeat this purpose.As a side note: We have attempted applying CytoRSuite, however, encountered several confusing issues that we’ve been unable to resolve: The CytoRSuite site (
https://dillonhammill.github.io/CytoRSuite) lists a GH repository that no-longer exists; we could find an installable version at
https://github.com/gfinak/cytoRSuite (is this the same?) but ‘drawGate()’ gave an error that we have not been able to resolve; meanwhile, any of CytoRSuite, cytoRSuite and cytoSuite (from which the latter has been forked) have not been changed in 4 years. Taken together, this gave us the feeling that the tool is no longer maintained and likely to be inapplicable with current versions of R and Bioconductor.Of course, there may be other tools available at this point for manual gating, and we leave it to the user to incorporate these into their workflow should that be of interest. A possible strategy then might be to i) perform manual gating and export the resulting gates into a table (gating scheme); ii) apply that scheme in an automated fashion (e.g., using the code we presented); and, iii) do manual adjustments to refine gates according to the current experiment.Compensation: The workflow includes compensation as a preprocessing step which the authors have shown in separate publications to improve data quality, but which is currently not routinely performed by many researchers working using mass cytometry. I, therefore, assume that most users of this pipeline would be relying on published spillover matrices that reflect estimates of isotope purity and oxidation. While I agree with the usefulness of this function, I believe that adding additional quality control functions could improve acceptance of and trust in this approach. For example, in flow cytometry, overcompensation is often easily spotted by the occurrence of overly negative values, however, using their NNLS approach this is not readily apparent in compensated mass cytometry data. It would be very helpful to have a quality metric that would alert users to such potential issues introduced by the compensation step.These are all very good points and legitimate concerns. As indicated in the original paper, the spillover matrix used to compensate mass cytometry data should be calculated based on the antibodies included in the panel. We should stress here that, based on the single-stained bead acquisition approach presented in the original paper, the experimental procedure required to generate a compensation matrix is fairly straightforward and can be achieved rapidly. Using a previously published spillover matrix is a risky strategy, which can indeed lead to inaccurate compensation. The user should instead first run the compensation in classic mode and perform a visual inspection to ensure no overcompensation can be detected before using the NNLS method. This is a valuable option to avoid this specific type of artefact. Automating this step is a good suggestion, but is out of the scope of this publication and comes with some disadvantages. The risk we see is that this process could be imperfect and potentially misleading for the user. Indeed, such an approach would only identify overcompensation in channels where a single positive population is present but not in the case of a double positive population. In other words, it will highly depend on the user’s data type. Furthermore, it would not identify under-compensated signals. As a consequence, providing an approach to alert users of potential issues would likely be imperfect and could give a wrong impression that the data are correctly compensated if no alert is raised, which is not necessarily the case. Moreover, to the best of our knowledge, such an approach also doesn’t exist in fluorescent flow cytometry, most likely due to the fact that ensuring accurate compensation on a fully stained data set is a challenging task. We should also mention that the spillover coefficients in mass cytometry rarely exceed 4% and therefore the consequences of a slight over or under-compensation are less important in mass cytometry than in flow cytometry.While this might only be needed in rare cases, a function to rename channels and potentially match these names across multiple fcs files could enhance the adaptability of this package. In my test case, conflicting channel names prevented the import of the files into the workflow. In other cases, it might help to match channel names between batches. The authors could look to the premessa package for inspiration.We very much appreciate this comment as we have encountered various discrepancies between panels, especially in long-term projects. To date, we have used a custom R script to i) read in files separately; ii) fix panels according to a reference file (i.e., removing/adding additional/missing channels); and, iii) write out a new set of FCS files with concordant panels.However, this solution is suboptimal as it leads to a duplication of files (or, in case the original files were overwritten, the process would no longer be reproducible). Similarly, a GUI solution (as ‘premessa’) would defeat the purpose of providing an automated, reproducible preprocessing solution. Thus, taken together, we propose (and have now implemented) the following strategy:> ‘prepData’ now exposes additional arguments to be passed to ‘flowCore::read.FCS’ via ‘...’> ‘read.FCS’ provides an argument ‘channel_alias’: “an optional ‘data.frame’ used to provide the alias of the channels to standardize and solve the discrepancy across FCS files. [...]”> independent of whether or not this option is used, ‘prepData’ will check whether panels (FCS channel names) match between files:in case of any discrepancy, the newly added ‘fix_chs’ argumentwill be used to determine how to resolve discrepancies“all” will keep all channels (i.e., the union across files);any missing channels will be added to the respective samples,and a channels x samples matrix is stored in the object to trackwhich channels were present in which samples originally“common” will keep shared channels (i.e., the intersection across files); any other channels will be dropped from the respective files‘prepData’ will, in any case, return a ‘SingleCellExperiment’, i.e.,no altered FCS files or ‘flowFrame’s will be written out / returnedThe authors have incorporated various options for DNA channels which is much appreciated. My test data had been stained with a rhodium intercalator. Specifying this worked well, only the res$scatter function seems to ignore this choice and instead seems to default to iridium DNA intercalators.We thank the reviewer for noticing this. Indeed, while the workflow allows for specification of the DNA channels used (via variable ‘dna’), these were fixed internally in CATALYST’s ‘normCytof()’ function. We have added an additional argument to allow passing custom DNA channel masses (default ‘dna = c(191, 193)’ for Ir191/3; for Rh103, the argument would be ‘dna = 103’ instead); the output scatter plot of DNA vs. bead intensities (‘res$scatter’) is now generated based on the first matching DNA channel (see updated ‘?normCytof’ documentation).Sample specific debarcoding is appreciated. Figure 6 and the plotYields function return a debarcoding percentage. I believe this percentage refers to percent of initial assignments, but it is not specifically stated. It might be helpful to get a feeling of the percentage of cells (out of total cells) that are assigned after refining the initial assignments.That is correct: As in the original Finck et al. outputs (a MATLAB application), yields (left-hand y-axis) correspond to the proportion of cells that would be retained upon applying a given cutoff (x-axis). In Figure 8, we compare the absolute barcode population sizes before vs. after debarcoding. Analogously, it would be straightforward for users to generate such a barplot from cell counts obtained when applying various thresholding schemes (e.g., no filtering compared to global vs. sample-specific separation cutoffs).The manuscript is presenting an updated version of the CATALYST package for preprocessing Cytof data. It is well detailed with several examples and has been updated based on the Bioconductor SingleCellExperiment class. Every step of preprocessing is clearly stated and illustrated to guide the user on the different steps to process their data. Also, new quality checks are being reviewed to explore the quality of the data.I provided below some feedback to make the manuscript clearer and some suggestions to address some issues I encountered:It will be useful to define clearly what are the differences between successive acquisitions, single CyTOF run and batch.The different samples and runs listed through the different examples could be better presented with a table containing all runs and samples. In the data description, it explains that "The dataset used in this study was obtained from a single CyTOF run containing nine references, three blood samples and three tumor samples barcoded with a 20-well barcoding plate". However in the quality checks section; additional data is being analyzed which makes it confusing, coming from additional runs, sometimes from 7 runs or other times from 8 runs.Batch alignment:Could you provide an additional plot showing the effect of applying this correction factor?How are you assessing the performance of your batch alignment method?argument norm_to in the normCytof() function: give explanations on how it being computed when giving reference beads, especially how does it compute the new baseline, does it takes into account both the beads from the reference and current by averaging both?Can it be used to normalize data from different batches? If so how does it deal with distinguishing times and ordering the beads and time which would be similar in separated batches?Figure 4: Could you please give more explanations on how to assess run sensitivity and how does the user decide what is acceptable and what is not?Also, you need to load the library(reshape2) to run this part.The wrap_plots function is missing here.I got an error when running the QC on reference cell counts. "Error: Can't combine `1$CellLine_R1` and `2$CellLine_R1` ."Minor comments:When running the code using the data provided, the directory name should be modified to "CyTOF_acquisition_1-3.FCS/" instead of datafcs <- list.files("CyTOF_acquisition_1-3.FCS/", "acquisition", full.names = TRUE)Also, it should be specified that the directory name containing all the data should be called "data" and it refers to the directory name, or an alternative is to have the local directory "." instead of data like in here:# specify path to reference beadsref_beads <- file.path(".", "normalization_beads.fcs")Introduction:"an important step is to correct for batch effects, which can be achieved by including a shared control sample in each independent batch
" Add CytofRUV reference mentioned in the discussion."In our example, barcode identifiers include each sample’s type (CellLine, PBMC or Tumor), group (R for reference or S for sample of interest), and replicate number; and follow a consistent naming scheme: We can easily extract these components and store them in the SCE’s cell metadata (colData)". The example selected is not the best one, as it not showing any differences between the 6 first row.There is a typo in the legend of some figures like figure 17: "previously" acquired runs.Is the rationale for developing the new method (or application) clearly explained?YesIs the description of the method technically sound?YesAre the conclusions about the method and its performance adequately supported by the findings presented in the article?YesIf any results are presented, are all the source data underlying the results available to ensure full reproducibility?YesAre sufficient details provided to allow replication of the method development and its use by others?PartlyReviewer Expertise:I work in Bioinformatics and especially in the normalization and batch correction of CyTOF datasets.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.Minor comments:It will be useful to define clearly what are the differences between successive acquisitions, single CyTOF run and batch.Indeed, the meaning behind the concepts of successive acquisitions, single CyTOF run and batch was not fully clear and these terms were not used in a consistent way. A “CyTOF run” corresponds to an independent experiment where samples are stained and acquired simultaneously on the CyTOF. We replaced the term run with experiment to clarify the meaning. Each CyTOF experiment corresponds to one “batch” and this term is used to refer to the batch correction which is performed on the different CyTOF experiments. The data from a single CyTOF experiment are usually acquired over multiple “successive acquisitions”, each leading to the generation of a single FCS file. We also made the use of these terms consistent throughout the paper.The different samples and runs listed through the different examples could be better presented with a table containing all runs and samples. In the data description, it explains that "The dataset used in this study was obtained from a single CyTOF run containing nine references, three blood samples and three tumor samples barcoded with a 20-well barcoding plate". However in the quality checks section; additional data is being analyzed which makes it confusing, coming from additional runs, sometimes from 7 runs or other times from 8 runs.The pipeline described in this paper was designed to preprocess CyTOF data acquired over a long period of time with a focus on ensuring data consistency over time. The aim of the workflow is to guide the readers through the preprocessing steps required to convert FCS files obtained in a given CyTOF experiment to a format suitable for downstream analysis, while presenting key quality checks to ensure the reliability of the data generated in the experiment of interest. Therefore, the whole analysis is based on a dataset obtained from a single CyTOF experiment, which is benchmarked against data acquired during a preparatory phase. For consistency reasons, we included now systematically the data from seven previous CyTOF experiments to benchmark the data of the CyTOF experiment preprocessed in this paper.Batch alignment: Could you provide an additional plot showing the effect of applying this correction factor? How are you assessing the performance of your batch alignment method?The batch alignment presented in this paper is based on a linear scaling based on a percentile, using references as anchoring points, similar to a previously published method (Schuyler et al, 2019). To assess the performance of our batch alignment method, we have now included a figure to compare the expression distributions before and after batch correction (including their 98th percentiles and those of the references). As intended, 98th percentiles align with the references’ upon correction, while expression distributions remain virtually unchanged.argument norm_to in the normCytof() function: give explanations on how it being computed when giving reference beads, especially how does it compute the new baseline, does it take into account both the beads from the reference and current by averaging both?Normalization using reference beads follows the methodology originally introduced in Finck et al. (2013): The baseline is computed as the mean intensity of the reference beads only, not including the current experiment. Would the average be taken over both, intensities would not be aligned between current and reference experiment. While the statement “[...] We provide the path to a set of reference beads (argument `norm_to`) that are used to compute baseline intensities for normalization” explains this only briefly, we believe that the method is well established and readers should refer to the original publication for more detail.Can it be used to normalize data from different batches? If so how does it deal with distinguishing times and ordering the beads and time which would be similar in separate batches?Yes, certainly. The normalization aims at correcting the signal time-drift due to progressive loss of sensitivity during acquisition. This is a technical effect that is independent of batch effects, and should be accounted for regardless of whether or not batch effects are present: these should be corrected for downstream analysis.Events from different FCS files (independent of whether these are different acquisitions of the same experiment or batches) are concatenated. How event times are dealt with depends on prepData()’s input arguments. When by_time = TRUE, files are ordered according to their acquisition time (stored under each flowFrame’s $BTIM description field). Otherwise, they are kept in the order provided by the input metadata table (argument md).Figure 4: Could you please give more explanations on how to assess run sensitivity and how does the user decide what is acceptable and what is not?Instrument sensitivity is an important parameter that should be closely monitored. This parameter is assessed during the tuning but those data cannot be easily exported and compared between experiments. The aim was to take advantage of the beads, which are run together with the samples to report on instrument sensitivity. Figure 3 provides key information regarding how the sensitivity evolves during the run, while the point of Figure 4 is to show how the average sensitivity evolves from one experiment to another. Instrument sensitivity varies from machine to machine and deciding what is acceptable will depend on the requirements of the users. The point of this plot was to offer an option for the user to easily identify in case the sensitivity is getting low compared to previous experiments, and to make a link between the quality of the data generated in a specific experiment with the sensitivity of the instrument.Also, you need to load the library(reshape2) to run this part.Yes, thank you for catching this; we’ve added reshape2 to the list of dependencies, and it is now loaded along the other required libraries.The wrap_plots function is missing here.Yes, thank you for catching this; we’ve added patchwork to the list of dependencies, and it is now loaded along the other required libraries.I got an error when running the QC on reference cell counts. "Error: Can't combine `1$CellLine_R1` and `2$CellLine_R1`."True, thank you; I could reproduce this with the current R and package versions. It has been fixed by converting the ‘run’ object of class ‘table’ to call ‘integer’ using c().When running the code using the data provided, the directory name should be modified to "CyTOF_acquisition_1-3.FCS/" instead of data:fcs <- list.files("CyTOF_acquisition_1-3.FCS/", "acquisition", full.names = TRUE)We are not sure we understand this comment. ‘list.files(path, pattern, …)’ expects the first argument to be a directory (where the FCS files are located), not the file names themselves (“xxx.FCS”).Also, it should be specified that the directory name containing all the data should be called "data" and it refers to the directory name, or an alternative is to have the local directory "." instead of data like in here:# specify path to reference beadsref_beads <- file.path(".", "normalization_beads.fcs")Thank you, yes, we forgot to mention that in the presented code all data used throughout the workflow is expected to sit inside a “data” subdirectory relative to where the .Rmd file is being run. We have now added a note explaining this in the 2nd paragraph under “Results”.Introduction: "an important step is to correct for batch effects, which can be achieved by including a shared control sample in each independent batch" Add CytofRUV reference mentioned in the discussion.We updated the reference to CytofRUV to the new version of the manuscript published in eLife and added it to the introduction."In our example, barcode identifiers include each sample’s type (CellLine, PBMC or Tumor), group (R for reference or S for sample of interest), and replicate number; and follow a consistent naming scheme: We can easily extract these components and store them in the SCE’s cell metadata (colData)". The example selected is not the best one, as it not showing any differences between the 6 first row.True. We have modified the example to sample 10 unique ‘sample’ entries (= type_group) for which to display the ‘colData’.There is a typo in the legend of some figures like figure 17: "previously" acquired runs.Thanks for pointing out this typo, which was corrected in the corresponding figures.
Authors: Anja Irmisch; Ximena Bonilla; Stéphane Chevrier; Kjong-Van Lehmann; Franziska Singer; Nora C Toussaint; Cinzia Esposito; Julien Mena; Emanuela S Milani; Ruben Casanova; Daniel J Stekhoven; Rebekka Wegmann; Francis Jacob; Bettina Sobottka; Sandra Goetze; Jack Kuipers; Jacobo Sarabia Del Castillo; Michael Prummer; Mustafa A Tuncel; Ulrike Menzel; Andrea Jacobs; Stefanie Engler; Sujana Sivapatham; Anja L Frei; Gabriele Gut; Joanna Ficek; Nicola Miglino; Rudolf Aebersold; Marina Bacac; Niko Beerenwinkel; Christian Beisel; Bernd Bodenmiller; Reinhard Dummer; Viola Heinzelmann-Schwarz; Viktor H Koelzer; Markus G Manz; Holger Moch; Lucas Pelkmans; Berend Snijder; Alexandre P A Theocharides; Markus Tolnay; Andreas Wicki; Bernd Wollscheid; Gunnar Rätsch; Mitchell P Levesque Journal: Cancer Cell Date: 2021-01-21 Impact factor: 31.743
Authors: Bernd Bodenmiller; Eli R Zunder; Rachel Finck; Tiffany J Chen; Erica S Savig; Robert V Bruggner; Erin F Simonds; Sean C Bendall; Karen Sachs; Peter O Krutzik; Garry P Nolan Journal: Nat Biotechnol Date: 2012-09 Impact factor: 54.908
Authors: Kipper Fletez-Brant; Josef Špidlen; Ryan R Brinkman; Mario Roederer; Pratip K Chattopadhyay Journal: Cytometry A Date: 2016-03-18 Impact factor: 4.355
Authors: Subarna Palit; Christoph Heuser; Gustavo P de Almeida; Fabian J Theis; Christina E Zielinski Journal: Front Immunol Date: 2019-07-03 Impact factor: 7.561
Authors: Felix J Hartmann; Joel Babdor; Pier Federico Gherardini; El-Ad D Amir; Kyle Jones; Bita Sahaf; Diana M Marquez; Peter Krutzik; Erika O'Donnell; Natalia Sigal; Holden T Maecker; Everett Meyer; Matthew H Spitzer; Sean C Bendall Journal: Cell Rep Date: 2019-07-16 Impact factor: 9.423
Authors: Greg Finak; Jacob Frelinger; Wenxin Jiang; Evan W Newell; John Ramey; Mark M Davis; Spyros A Kalams; Stephen C De Rosa; Raphael Gottardo Journal: PLoS Comput Biol Date: 2014-08-28 Impact factor: 4.475
Authors: Xiaoying Zhou; Wong Yu; Diane M Dunham; Jackson P Schuetz; Catherine A Blish; Rosemarie H DeKruyff; Kari C Nadeau Journal: J Clin Invest Date: 2022-10-17 Impact factor: 19.456