Pratyaydipta Rudra1, Ryan Baxter2, Elena W Y Hsieh2,3, Debashis Ghosh4. 1. Department of Statistics, Oklahoms State University, Stillwater, OK 74078, USA. 2. Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA. 3. Department of Pediatrics, Section of Allergy and Immunology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA. 4. Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.
Abstract
MOTIVATION: Cell-type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. RESULTS: We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n < 25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. AVAILABILITY AND IMPLEMENTATION: CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. CONTACT: prudra@okstate.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
MOTIVATION: Cell-type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. RESULTS: We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n < 25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. AVAILABILITY AND IMPLEMENTATION: CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. CONTACT: prudra@okstate.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
Recent developments in single-cell-based technologies, such as mass cytometry (i.e. cytometry by time-of-flight, CyTOF), have led to the need for computational and analytic approaches that can accommodate the high dimensionality and single-cell granularity. The analysis of CyTOF data can elucidate novel disease biomarkers and mechanisms of the underlying immunopathology, leading to improved treatments and prognostic measures.Mass cytometry allows the simultaneous detection of more than 40 proteins per cell in hundreds of thousands of cells per sample (Bendall ; Saeys ). The data are often clustered into cell subpopulations first, which can then be used to answer scientific questions regarding the abundance of cell types and expressions of specific parameters (e.g. activation markers, signaling proteins, cytokines) across groups, such as disease and control groups, pre- and post-treatment groups, or samples that are stimulated or not. There have been significant research on clustering procedures with these high-dimensional datasets [see Aghaeepour and Weber and Robinson (2016) for a review]. We will focus on the downstream statistical analysis after the clustering has been performed. The statistical questions about the tree-structured cell population data (e.g. Fig. 1) can be visualized in two layers. First, it is clinically interesting to know if the abundance of the cell subpopulations is different across two or more groups and/or conditions. Given the proportion of cell types for each sample, the next question is whether there is any differential expression of activation markers, signaling proteins or cytokines (functional measurements of the cell populations studied). The latter is also known as ‘cell state’ analysis while the former is called ‘cell type’ analysis (Weber ). In this paper, we focus on the analysis of cell type.
Fig. 1.
Hierarchical tree structure of the cell types from the SLE study 2. See Supplementary Materials for a full list of the cell subpopulations. It is of interest to (i) test if the compositional profile of the cell types is associated with the disease groups, i.e. if there is differential abundance of any of the cell types between SLE patients and controls; and (ii) if yes, which cell types contribute the most to the association
Hierarchical tree structure of the cell types from the SLE study 2. See Supplementary Materials for a full list of the cell subpopulations. It is of interest to (i) test if the compositional profile of the cell types is associated with the disease groups, i.e. if there is differential abundance of any of the cell types between SLE patients and controls; and (ii) if yes, which cell types contribute the most to the associationWhile there are a variety of methods that test differential cell type abundance (Arvaniti and Claassen, 2017; Bruggner ; Lun ; Weber ), several of them suffer from limitations such asDifficulty of interpretation due to not clearly distinguishing between cell type (abundance/frequency) and cell state (activation, function) questions.No way of testing statistical significance for individual features (cell types).Computational scalability.See Weber for a more detailed discussion on these limitations. The two approaches that overcome these limitations are (i) a GLMM-based approach using logistic mixed model (Nowicka ) and (ii) the diffcyt method by Weber . These approaches effectively perform mass univariate analyses for each cell type. This ignores correlations between cell types as well as increases the multiple testing burden. In this paper, we propose a statistical framework that can test for the association of the multidimensional cell-type profile (or, ‘cell-type abundance’, to be used interchangeably) with the predictor variables (e.g. disease groups). Starting the analysis with the test of this global null hypothesis avoids the problem of multiple comparisons and also accounts for the correlation structure present in the data.This multivariate approach can often help us better understand the biological functions of immune cell types. In the context of understanding disturbances to normal immune phenotype and function (i.e. a disease process, iatrogenic intervention), the interdependent relations between different cell types and their function need to be addressed collectively. If we evaluated statistical significance for changes in each single-cell population, independent of the frequency changes in other cellular populations, it would not accurately depict the immunopathology, as the immune system is dynamic and changes in one compartment directly (or indirectly) affect the other. This is also the reason why ‘biomarker panels’ are currently used instead of single disease biomarkers, since single parameters do not accurately prognosticate disease progression or response to therapy and cannot account for the underlying pathology (no matter how significant that single parameter is). Lastly, the field of multiomic and systems immunology is accelerating our understanding of disease processes because of the capability to assess multiple processes simultaneously, depicting a closer view of the disease process. Hence, statistical analyses must account for this multiplexed capability as opposed to addressing single differences.
1.1 Motivating data
1.1.1 Study 1
This example originates from a previous CyTOF study conducted by O’Gorman to understand single-cell phenotypic and functional characterization of pediatric Systemic Lupus Erythematosus (SLE) patients and healthy controls. SLE patients with presentation prior to age 18-years old and age and gender-matched controls were recruited for the study. Peripheral blood was collected at initial diagnosis prior to the initiation of any treatment. Blood samples were fixed either immediately after collection (T0); or after incubation at 37°C with a protein transport inhibitor cocktail for 6 h (T6). The final dataset after all filtering steps contained 28 single-cell data files (7 patients and 7 controls, and 2 stimulation conditions, T0 and T6, per subject). Single-cell data for every subject were manually gated to obtain the hierarchical clustered cell type data. It is of interest to test if the cell-type compositions are different (i) across the disease groups (ii) across the stimulation conditions.
1.1.2 Study 2
A second study comparing SLE patients and healthy controls is currently being conducted by the authors with a larger number of participants (see Supplementary Section S5, for a partial analysis of this data using our method). The hierarchical clustering structure for this study is shown in Figure 1. Due to the hierarchical nature of the data, cell-type abundance can be defined in many different ways by choosing different parent and children nodes of the tree. For example, it might be of interest to consider the abundance of all cell types from the terminal nodes as a fraction of the lymphocytes, but one may also want to test, e.g. the abundance of the B-cell subpopulations as a fraction of the B-cells. Our proposed multivariate approach can be used to conduct the test to answer each of these questions separately.
1.2 Statistical challenge
Data on cell-type abundance is compositional by nature, i.e. the sum of the cell-type proportions add up to one. In mathematical notation, if denotes the cell-type abundance of the q cell types, then
where is the q-dimensional simplex (Aitchison, 1982). Due to this, the classical statistical models for non-compositional data are not appropriate for testing differential abundance in mass cytometry.Although there is a rich literature of statistical methods for compositional data analysis (Pawlowsky-Glahn ), most of the traditional methods of compositional data analysis (often based on multinomial or Dirichlet distributions) test the association of the predictors with the individual components one at a time. A multivariate approach accounting for the correlation among the components is expected to perform better and have higher statistical power due to a decreased burden of multiple testing. The advantage of using such multivariate approaches for genomic association studies is well documented (e.g. Broadaway ; Wu ). The authors are unaware of an appropriate multivariate approach to test association in high-dimensional cell-type abundance data arising from mass cytometry.The traditional generalized linear models cannot be used here due to the presence of overdispersion typical for these data. The state-of-the-art methods to analyze differential abundance for mass cytometry data are based on classical generalized linear mixed models (GLMM; Nowicka ) with the help of ‘observation-level random effects’ (OLRE), or based on newer developments such as edgeR, limma or voom (Law ; Ritchie ; Robinson ). The recently developed diffcyt methods (Weber ) using the above approaches are shown to perform well for mass cytometry data, but they have the same limitation of approaching the problem in an univariate manner. Also, it has been shown that the statistical tests based on the GLMM approach often leads to anticonservative results, especially in small samples (Bolker ; Forstmeier ; Silk ) which is typical for clinical studies using CyTOF data. The newer methods such as edgeR or voom do not provide theoretical guarantee of type-I error control either. These methods have been shown to have inflated type-I error rate when applied to other types of data (Datta and Nettleton, 2014; Hawinkel ; Rocke ; Vestal ). In this paper, we propose a multivariate statistical framework, ‘Compositional Data Analysis using Kernels’ (CODAK), based on kernel distance covariance (KDC; Hua and Ghosh, 2015) to quantify and test the association of predictors such as grouping or application of drugs with the composition profile of cell types. This association test can often be used as the test of differential abundance, but we present the method in general so that it can be used for more general cases (e.g. continuous predictor) besides the two-group comparison. We also propose two approaches of covariate adjustment under this framework and suggest some follow-up methods to understand which components of the composition are most responsible for the association. We illustrate the performance of our methods using extensive simulation studies and also apply it to high-dimensional mass cytometry dataset we collected on SLE patients and healthy control subjects. Analysis of the data revealed clinically relevant patterns such as differential cell type abundance between the disease and the control group.
2 Materials and methods
2.1 The kernel distance covariance framework
Distance covariance/correlation is a method to quantify and test for association between random variables of arbitrary dimensions (Székely , 2009). It is powerful against any form of lack of independence. The distance covariance approach is closely related to the kernel-based approaches using Hilbert Schmidt Independence Criterion (HSIC; Gretton ). The equivalence of the two approaches was shown by Sejdinovic and Shen and Vogelstein (2020). Hua and Ghosh (2015) discussed the equivalence in the context of genetic association studies and introduced the term ‘kernel distance covariance’ (KDC).For n measurements on two multidimensional random variables and , let us denote the observation from ith sample unit as . Define the matrices and as
where k and l are the appropriate kernel functions measuring the similarity of pairs of observations. The KDC statistic is defined as
where is the centering matrix, I being the identity matrix of dimension n and being the vector with each element equal to 1.For our application of the KDC approach, suppose we have a (potentially multivariate) predictor and that the cell-type abundance is given by , where . A key aspect of the KDC approach is the choice of the kernels k and l. Some common choices of kernels in association studies are the linear kernel, polynomial kernel and Gaussian kernel (Schölkopf ). However, these do not directly apply to cell-type abundance compositional data since the data belongs to a simplex. For CODAK, we propose a kernel using Aitchison distance (Aitchison, 1982) as an appropriate kernel for measuring similarity between two compositions. Let the cell-type composition for the ith sample be . Then the similarity between the composition profiles of the ith and the jth sample can be given by (Gower, 1966)
where the element of the matrix D2 is , square of the Aitchison distance (AD) between and . The AD is defined by
where g(P) is the geometric mean of . Alternatively, one can use a Gaussian kernel (Schölkopf ) with AD as
where is the AD between and and γ is a tuning parameter which is often chosen as the median distance. Our empirical results show that the performance of these two kernels [Equations (3) and (6)] are nearly identical (Supplementary Fig. S6).Note that AD is same as the Euclidean distance calculated after applying the centered log-ratio (clr) transformation (discussed later in section) to the compositions. The AD has some desirable properties such as scale invariance, permutation invariance, perturbation invariance and subcompositional dominance [see Martín-Fernández for a detailed discussion]. For example, scale invariance ensures that rescaling each composition does not change the distance, and subcompositional dominance ensures that the distance between two subcompositions can only be smaller than the distance between the original compositions. Neither of these two properties is satisfied by the Euclidean distance (Martín-Fernández ). These properties are important for using distance-based methods, and failing to ensure them can result in reduced statistical power (Gloor ; Pawlowsky-Glahn and Buccianti, 2011). We have also demonstrated this in our simulation studies (Section 3).Another commonly used distance for compositional data is the Bray–Curtis (BC) distance (Bray and Curtis, 1957). Therefore, we have also implemented a kernel based on the BC distance by replacing the AD in Equation (3) by the BC distance defined asHowever, the BC distance is not a proper distance (Greenacre and Primicerio, 2014) and therefore the kernel defined based on it may not be positive semidefinite. It is a common practice to modify the BC kernel matrix by changing the negative eigen values to 0 (Zhao ), but this may lead to inferior results.In some other fields such as microbiome data analysis where the Operational Taxonomic Units (OTU) are related by a phylogenetic tree, it is important to account for the hierarchical structure of the phylogenetic tree to capture the degree of evolutionary divergence between neighboring bacterial groups. Different versions of Unifrac distance have been considered in the microbiome data analysis literature for this purpose (Chen ; Lozupone and Knight, 2005; Lozupone ; Wong ). Other approaches to account for hierarchical structures have been studied by Silverman , Wang and Zhao (2017a, 2017b) and Wang . While it is possible to extend CODAK to incorporate distances that account for a hierarchical tree structure, for instance, using weighted Unifrac (Lozupone ) or generalized Unifrac distance (Chen ), doing so is less relevant for CyTOF as there is no natural equivalent of phylogenetic distance. Therefore, we have not used such a distance in favor of the simplicity of interpretation and follow-up analysis for the immunologist.For most applications, is univariate and for the remainder of the paper we treat it as such. We use a linear kernel for continuous predictors defined by
and a kernel based on Hamming distance for binary predictors given bySince the Hamming distance is a distance metric, we can use standard arguments (Sejdinovic ; Shen and Vogelstein, 2020) to show that (9) is a proper kernel. We use a permutation method to obtain the null distribution of the KDC statistic. We have also explored the performance of a Gaussian kernel for X and found it to have relatively worse statistical power (Supplementary Fig. S7).The intuition for the KDC approach for the motivating data (Section 1.1.1) is demonstrated in Figure 2. The figure shows the densities for the kernel similarity measures when comparing the two disease groups (first panel) and when comparing the two stimulation conditions (second panel). The red curve shows the within-group (or condition) similarity and the blue curve shows between-group (or condition) similarity. It is often the case that the cell type compositions are different in two disease groups (SLE and healthy control). Loosely speaking, this is same as saying that the similarity in the compositional profiles between observations from the same disease group is likely to be more compared to that between observations from different disease groups. This is reflected in the first panel where the red ‘same group’ curve is located to the right of the blue ‘different group’ curve. On the other hand, the compositions are less likely to vary across stimulation conditions, which is why the curves in the second panel are more similar.
Fig. 2.
Motivation of the KDC approach for the first SLE study: the densities for the kernel similarity measures are plotted when comparing the two disease groups (first panel) and when comparing the two stimulation conditions. The red curve shows the similarity of observations within the same group (conditions) and the blue curve shows the similarity of observations between groups (conditions)
Motivation of the KDC approach for the first SLE study: the densities for the kernel similarity measures are plotted when comparing the two disease groups (first panel) and when comparing the two stimulation conditions. The red curve shows the similarity of observations within the same group (conditions) and the blue curve shows the similarity of observations between groups (conditions)
2.2 Adjusting for covariates
Our framework CODAK allows for covariate adjustment using two different approaches. Suppose we have a set of covariates . Denote the value of Z from the ith subject as . In the first approach, we use the additive log-ratio transformation (alr) commonly used in compositional data analysis (Aitchison, 1982; Aitchison ; Pawlowsky-Glahn ) to transform the data to , perform covariate adjustment to compute the residuals using linear regression, and transform the residuals back to the simplex using the inverse alr-transformation: . Here, H is the hat matrix obtained from the design matrix of Z and I is the identity matrix. We can then also compute the residuals by regressing X on Z, and use CODAK on P(Z) and X(Z). The alr-transformation and the inverse alr-transformation for a single compositional observation are given by
andHere, is the closure operator which divides each component of the vector by the sum to ensure the constant sum 1 of the resulting vector. We can skip this technicality of converting to a composition such that due to the scale invariance property of the AD. Note, however, that the alr-transformation is asymmetric in the components, and the alr-coordinates are non-isometric in nature (Pawlowsky-Glahn and Buccianti, 2011). Alternatively, one may use the centered log-ratio (clr) transformation and its inverse given by
and
where g(P) is the geometric mean of . The clr-transformed vector Y is symmetrical in the components, but belongs to a subset of due to the constraint . One can obtain an orthonormal coordinate system based on the clr-transformed vector, but it is not necessary for our purpose, and it can be shown that alr and clr have identical result for our covariate adjustment (see Supplementary Section S7 for a simple proof). We present the results using the alr-based approach (CODAK-alr) in Section 3.A second approach, motivated by Zhan , is developed when the covariates are categorical. This approach is based on the idea of stratified kernel, which, in our context, is defined asThe kernel l is defined similarly. The method essentially considers two observations i and j to have (potentially) non-zero similarity only if they are from the same stratum, where the strata are defined by the values of the categorical covariates. We can then proceed to use CODAK using Equation (2). It can be shown that the stratified kernel defined in this manner is strictly positive definite (Park ; Zhan ). It should be noted that this method reduces the effective sample size by only considering similarity within strata and therefore may lose power, especially in the presence of multiple categorical covariates. However, in many applications (e.g. the motivating problem), we only have one categorical covariate. The stratified kernel approach (CODAK-sk) can be effective in such cases, and has been empirically shown to be robust against slight violation of the independence of sample observations (Section 3.3).One important consideration for permutation test in the presence of covariates is the choice of permutation method. Kennedy and Cade (1996) suggested residualizing both Y and X on Z and using a permutation test for the association between Y(Z) and X(Z) as discussed above. However, several studies have found the Kennedy and Cade method to be anticonservative (Anderson and Legendre, 1999; Winkler ) for linear and generalized linear models. These studies found that the alternative approach of Freedman and Lane (1983) achieves better control of type-I error. We have explored both the Kennedy and Cade (KC) method and the Freedman and Lane (FL) method here.
2.3 A measure of effect size
A measure of estimated effect size can be obtained by using the idea of distance correlation (Székely ). Following Sejdinovic , the distance correlation (dcor) between the variables P and X, in our context, can be defined asThis is equivalent to the squared distance correlation from Székely if both K and L are L2-distance kernels (Hua and Ghosh, 2015). Since the L2-distance is not the appropriate distance for compositional data for reasons discussed earlier, the ‘original’ distance correlation (Székely ) should not be used in this case. We include some results on application of the original distance correlation method for this type of data in Supplementary Figure S8.
2.4 Follow-up methods for individual components
One criticism of multivariate approaches is the difficulty of interpretation for individual components of the composition profile. For example, as a follow up to differentially abundant cell-type compositions, it is often of interest to understand the cell types that are the top contributors to the differential abundance. While one can use the traditional effect sizes such as odds ratio or average difference in proportions for each cell type, we provide two kernel-based solutions here.
2.4.1 Leave-one-out approach
One intuition is that if a component c of the composition contributes to the association of the compositional profile P with a predictor X, then dropping that component from the compositional profile should lead to a decrease in the distance correlation. Therefore, we can compute the following leave-one-out (LOO) statistic for every component and rank them in order of their values:
where . It is obvious from Equation (4) that the AD between two compositions and reduces when a component is excluded (subcompositional dominance). Further, it can be shown (see Supplementary Section S6 for a simple proof) that the reduction is maximized when the component c with the value farthest from the mean is excluded, i.e.This further strengthens the above intuition since it is clear that the component with the highest true ratio of abundance in the two-groups contributes the most to the AD between the composition profiles of observations in different groups. Therefore, we propose picking the top cell-type candidates based on the ranking of this LOO statistic defined in Equation (17).
2.4.2 Weighted distance correlation approach
A second approach is motivated by Wen and based on weighted Aitchison distance (Egozcue and Pawlowsky-Glahn, 2016). Following Wen , we define weighted distance correlation between P and X by modifying the kernel k in Equation (6) to use the weighted AD instead of , where the weighted AD is defined as
where and . A set of weights that maximize the weighted distance correlation between P and X can then be obtained. Following Wen , we simplify the optimization problem by defining
and by subsequently optimizing for , where β is the distance correlation between P and X. Larger values of γ make the weights of the components with the strongest association with X contribute more to the weighted AD and subsequently to the KDC statistic. These weights can then be ranked in order of magnitude to indicate the top cell-type candidates contributing to the association.
2.5 Implementation
CODAK is implemented using statistical software package R, and the codes are available at http://github.com/GhoshLab/CODAK/. The time taken on a single computer to run CODAK for a mass cytometry data of usual size ranges from less than a second to ∼90 s depending on number of permutations used for the test (103 to 106). A comparison of the computation time of CODAK with commonly used methods for CyTOF cell-type abundance testing is provided in Supplementary Table S1.
2.6 Relationship between CODAK and other kernel/distance-based approaches
The kernel distance covariance method of testing has previously been shown to be closely related to some other approaches. The kernel machine regression (KMR) approach (Kwee ; Liu ) uses a semiparametric linear model where the relationship of the predictor variable X with the response Y is modeled through a function which is estimated from the data. The null hypothesis of interest is . The form of the function is determined by a user-specified kernel function and the test is often done using a score test. Hua and Ghosh (2015) showed that this test using the KMR approach is equivalent to the KDC approach when the kernels chosen for Y and X in the KDC approach are K and the linear kernel, respectively. The KMR approach has been used in many omics applications (e.g. Maity ; Wu ). One such application was developed for compositional data in a different set up by Zhao . They proposed MiRKAT, a test of association for microbiome abundance data. In their model, they used different forms of Unifrac distances (Lozupone and Knight, 2005) and BC distance (Bray and Curtis, 1957), but did not use AD. However, the MiRKAT method can in principle be used with AD.Another related approach is PERMANOVA, developed by McArdle and Anderson (2001), which tests the association between the two variables using a MANOVA-like F-statistic while obtaining the P-values using permutations. Although it was originally developed for the one-way ANOVA case, several extensions have been proposed to use quantitative variables and covariates (Anderson, 2014). PERMANOVA is a distance-based approach where the user can choose any appropriate distance including AD. Pan (2011) showed that the KMR approach (when using permutation test) and PERMANOVA are equivalent when there are no covariates. Therefore, the KDC approach with linear kernel for the predictor, the KMR approach and the PERMANOVA approach are all equivalent in the no covariate case, and CODAK can be thought of as either of these approaches applied to compositional data from mass cytometry through an appropriate AD kernel.However, these methods are not all equivalent when adjusting for covariates as we show using our simulation results. In particular, based on the discussion of Hua and Ghosh (2015), the KDC approach is equivalent to KMR if we only adjust the response variable Y for the presence of Z, the covariate. However, adjusting only Y and not X in a permutation test can lead to suboptimal results (Kennedy and Cade, 1996). Therefore, the covariate adjustment procedure of CODAK is not equivalent to PERMANOVA or MiRKAT, and we explore the relative performances of them in section.Comparison of statistical power for binary predictor adjusting for a binary covariate. The black dashed line in the first plot shows the nominal level α and the gray dashed line shows two times α. Only the methods with reasonable control of type-I error are shown in the other three plotsBoth PERMANOVA and MiRKAT provide an R2-like measure of effect size. The R2 of MiRKAT using AD is equivalent to the square of our dcor measure [Equation (15)]. The R2 of PERMANOVA is not equivalent to these (see Supplementary Section S2 for more discussion).
3 Results
3.1 Simulation studies
3.1.1 Description of the simulations
We conducted simulations to compare type-I error and power performance of the different methods for the following scenarios: (i) no covariate (binary or continuous predictor) (ii) with covariate adjustment (binary or continuous predictor) (iii) with covariate adjustment for repeated measures (binary predictor). We report the results for the binary predictor (i.e. two-group comparisons) here (Fig. 3) and the results for continuous predictor in Supplementary Figures S2 and S4. The third scenario was explored to understand the robustness of the methods for the violation of the independence assumption.
Fig. 3.
Comparison of statistical power for binary predictor adjusting for a binary covariate. The black dashed line in the first plot shows the nominal level α and the gray dashed line shows two times α. Only the methods with reasonable control of type-I error are shown in the other three plots
The ‘true’ effect sizes of multivariate parameter vectors (as used in the simulations) are difficult to illustrate using power plots. Instead, measures such as the true maximum log odds ratio (OR) or the true Aitchison distance can be used in place of effect size, which we explored separately in Figure 4. In order to demonstrate the comparison of statistical power, for each of the above scenarios, we considered four cases: (a) no association (null hypothesis) (b) all cell types have some small association with the predictor (e.g. small difference in abundance of every cell type across two groups), (c) 50% of the cell types have small associations with the predictor (d) 25% of the cell types have larger associations with the predictor. The ‘small association’ is defined as the 0.2 percentage point difference, and ‘larger difference’ as 0.4 percentage point difference between the two groups. For scenario (ii), the effect of the covariate Z is simulated similarly to the effect of the predictor in case (b). We used N = 10 000 simulations for each case and plotted the estimated size and power against different choices of the nominal level α.
Fig. 4.
Comparison of CODAK with GLMM and diffcyt-voom for various effect sizes. For every simulation scenario, the true Aitchison distance (AD) and true maximum log odds ratio are plotted. The colors represent the method with a higher statistical power for that scenario. It is evident that CODAK favors higher AD while the other methods favor strong effects for individual components. Scenarios AD > 1 or are not shown since all methods had perfect power in such cases
Comparison of CODAK with GLMM and diffcyt-voom for various effect sizes. For every simulation scenario, the true Aitchison distance (AD) and true maximum log odds ratio are plotted. The colors represent the method with a higher statistical power for that scenario. It is evident that CODAK favors higher AD while the other methods favor strong effects for individual components. Scenarios AD > 1 or are not shown since all methods had perfect power in such casesWe simulated data for subjects, i.e. n = 12 per group for the two-group comparison. Using a multinomial logit model, we generated 30 000–50 000 cells per individual sample with probability depending on the value of the predictor for a cell to be one of the 20 cell types. An OLRE term was included in the multinomial logit model to model the overdispersion present in real CyTOF datasets. The parameters, such as probabilities in the control group (ranging from 0.002 to 0.15), and the standard deviation of the random effects term () were chosen based on the estimates obtained from real data (Study 1.1.1). For scenario (iii), we used an additional subject-specific random effect term to induce the correlation in the repeated measures structure.
3.2 Competing methods
Besides using CODAK with the AD kernel, we included CODAK with Bray-Curtis distance (CODAK-BC) and Euclidean distance (CODAK-ED) for the purpose of comparison. We also compared the results using the original distance correlation (Székely ), but did not include it in Figures 3–6 since its performance was nearly identical to CODAK-ED (see Supplementary Fig. S8 for a comparison). We used 10 000 permutations for each CODAK method. To compare the performance of CODAK with the commonly used methods for testing cell type abundance in mass cytometry, we considered two GLMM-based methods and two methods from the diffcyt package. The GLMM-based methods were both logistic mixed effects models to test the association of the predictor with the abundance of each cell type (followed by adjustments for multiple testing using Bonferroni method), the difference among the two methods being the test used: Wald test or likelihood ratio test (LRT). The methods for diffcyt were also conducted per cell type (followed by adjustments for multiple testing) using either the edgeR or the voom method [as suggested by Weber ]. It should be noted that there is also a GLMM test option in the diffcyt package which is essentially same as the Wald test we considered. Since both the GLMM-based methods resulted in inflated type-I error, we also applied a resampling method using the LRT statistic where the LRT statistic was calculated using the logistic mixed model and a permutation method was used to obtain the P-value instead of using the asymptotic distribution of the LRT statistic. Due to high computational time, we only used this method for one of our simulation scenarios and we only reported the power/type-I error rate for .Comparison of statistical power for binary predictor. The black dashed line in the first plot shows the nominal level α and the gray dashed line shows two times α. Only the methods with reasonable control of type-I error are shown in the other three plots. The power for the LRT-permutation is shown for one choice of α due to the high computation timeComparison of statistical power for binary predictor adjusting for a binary covariate with repeated measures. The black dashed line in the first plot shows the nominal level α and the gray dashed line shows two times α. Only the methods with reasonable control of type-I error are shown in the other three plotsWhen adjusting for covariates, we included them directly in the model for the GLMM or diffcyt methods. We have also compared PERMANOVA (McArdle and Anderson, 2001) and MiRKAT (Zhao ) using AD since these methods are not equivalent to CODAK when adjusting for covariates. R-packages MiRKAT (Zheng ) and vegan (Oksanen ) were used. For scenario (iii) with repeated measures, when fitting the GLMM models, we used the known information on which observations are coming from the same individual subject to fit multilevel models accounting for the repeated measures structure. On the other hand, CODAK did not use this information.
3.3 Results from the simulation studies
The results for binary predictor (Fig. 3) shows that all the three CODAK methods controlled the type-I error at the target level, but none of the other methods could do so. The type-I error control of diffcyt-voom was near satisfactory and the LRT method among the GLMM methods lead to less inflation of the type-I error. Therefore, we choose one from each of the approaches of the competing methods and only show the diffcyt-voom and GLMM-LRT for the subsequent power comparisons. However, the comparison of power of GLMM-LRT and the other two methods are still unfair since the size of GLMM-LRT is approximately twice the as large as the target α level for the range of values of α considered. The power comparison shows that CODAK (using the default distance AD) performed better than the existing univariate methods in the case with small differences in all cell types (case b). With the reduction in the number of associated cell types, the relative advantage of CODAK compared to the univariate methods gradually diminished. The resampling-based LRT method controlled the type-I error at the nominal level, but failed to achieve high power. It is also extremely computationally intensive. CODAK using BC or Euclidean distance failed to provide satisfactory power. This clearly demonstrates that CODAK with AD has far superior performance compared to these distances. Hence, these distances were not used for the next simulation scenarios. Similar results were obtained for continuous predictor (see Supplementary Fig. S2).The results for binary predictor in the presence of a covariate (Fig. 3) lead to similar findings. CODAK-alr had slightly inflated type-I error here when we used the KC method of permutations (Kennedy and Cade, 1996), but it was comparable to diffcyt-voom. Using the FL method of permutations (Freedman and Lane, 1983) for CODAK-alr resulted in good control of type-I error, but slightly less power compared to the KC method. This is consistent with previous findings for linear models (Anderson and Legendre, 1999). CODAK-sk also controlled of type-I error at the nominal level. However, CODAK-sk had slightly inferior power performance compared to CODAK-alr. This is not unexpected, as discussed in section. Both PERMANOVA and MiRKAT appeared to be somewhat conservative and had less statistical power. The statistical power of MiRKAT was especially poor. It is likely to be due to the fact that it essentially adjusts the effect of the covariate only on the response variable. Such behavior of covariate adjustment methods has previously been discussed in the literature (Kennedy and Cade, 1996).The simulation results from scenario (iii) are shown in Figure 6. CODAK-sk achieved reasonable control of type-I error under slight misspecification of the independence assumption (subject-specific random effect variance = 0.1) and the power was also comparable to its performance in other cases. None of the other methods (including CODAK-alr) controlled the type-I error rate. We also verified that when the amount of dependence was increased, the control of type-I error by CODAK-sk gradually worsened (Supplementary Fig. S5).
Fig. 6.
Comparison of statistical power for binary predictor adjusting for a binary covariate with repeated measures. The black dashed line in the first plot shows the nominal level α and the gray dashed line shows two times α. Only the methods with reasonable control of type-I error are shown in the other three plots
To better understand the situations where CODAK (or other univariate methods) have advantage, we plotted the true AD and maximum log(OR) for each simulation [from scenario (i)] and color coded according to which method had higher power (Fig. 4). It is not unexpected that the GLMM and diffcyt-voom appeared to perform better than CODAK when maximum log(OR) was high but the AD was not (blue/green points). These were the cases where a smaller number of cell types are associated and the association is strong. CODAK performed better for effects that were relatively weaker, but more spread out across the cell types (red points). All methods performed well (black points) when there was strong association for several cell types, or very strong association for fewer cell types.Finally, we compared the ranking of the cell types for the different follow-up methods. For every simulation, we calculated the rank correlation of the true log(OR) with the rank of the cell types obtained using the LOO and the weighted dcor methods. The median of the rank correlations was 0.49 for the LOO method and 0.41 for the weighted dcor method. These are reasonably high considering that the rank correlation between estimated log(OR) and the true log(OR) was 0.42. The median number of ‘true’ top-5 cell types in the top-5 list provided by these methods was 3 for all the three methods [LOO, weighted dcor, and estimated log(OR)]. A histogram of the number of cell types (among top five) identified by the different methods is shown in Supplementary Figure S10, which shows that both LOO and weighted dcor performed close to the ‘true’ model using the log(OR), with the LOO method performing slightly better. Based on these results, we can conclude that the performance of the LOO-based follow-up method to rank the contribution of the individual cell types is satisfactory and should be preferred over the weighted dcor method.
3.4 Real data
In order to study the performance of CODAK on real data we applied it on the SLE data from Section 1.1.1. SLE is a systemic inflammatory disease in which multiple immunological derrangements have been described—Toll-like receptor signaling defects, over activation of neutrophil subsets, decreased regulatory T cell frequency due to T cell tolerance defects, excessive type-I interferon downstream activation, and autoantibody production by pathogenic B-cells, to name a few (Crow, 2014; Zharkova ). Therefore, one should expect that study of a single or a couple immune cell subsets in isolation would not fully depict the immunopathology. Recent publications have supported the multiimmune component etiology of the disease, and in fact, different disease phenotypes may be correlated with specific immune profiles that encompass the integrated immunological picture (Nehar-Belaid ). Therefore, our work analyzing multiple immune cell subsets and their downstream cytokine production, as a composite integrated approach, most accurately addresses the basic science and clinical questions.Each of the datasets described in Section 1.1 were generated via Fluidigm mass cytometer instruments. Resultant FCS files were bead normalized to account for instrument calibration. The final cell-type abundance proportion matrix after the data filtering steps (debarcoding, bead normalization, batch adjustment/normalization) had 28 rows and 12 columns, where each row corresponded to an observation and each column to a cell type. These different cell types were CD4+ T-cells, CD8+ T-cells, CD27-hi B-cells, CD27-lo B-cells, Basophils, CD14-hi monocytes, CD16-hi monocytes, CD56-hi NK-cells, CD56-mid NK-cells, cDCs, pDCs and other cells. The predictor of interest was the disease status (SLE or healthy control), and a potential covariate was the stimulation condition (T0 or T6). Applying the CODAK test on the full dataset with 106 permutations resulted in a distance correlation 0.6966 and a P-value . When adjusted for the covariate stimulation condition (using CODAK-sk), the distance correlation was 0.7974 and the P-value . However, one needs to be cautious due to the fact that the study contains two repeated measures on each individual subject which leads to the violation of the independence assumption. Separate testing for conditions T0 and T6 still resulted in statistically significant results (P = 0.0006 for both T0 and T6).We also obtained the dcor statistic for every cell type. The results are shown in Figure 7. The top three contributors for both conditions were CD27-lo B-cells, conventional dendritic cells (cDCs) and CD16-hi monocytes. These results are consistent with previously published literature (O’Gorman , 2017; Rodríguez-Bayona ). Pathogenic B-cells have long been implicated in the pathogenesis of SLE, including expansion of CD21-lo B-cells, which are also most often CD27-lo (Dörner ). Monocytes and dendritic cells constitute key elements of a proinflammatory innate immune response and they are significantly influenced by the circulating cytokine environment. In the study group shown, patients demonstrated significantly increased proinflammatory serum cytokines (data not shown), which likely accounts for the expanded CD16-hi monocytes and cDCs subpopulations (Steinbach ).
Fig. 7.
The values of the dcor statistic for testing the difference in cell type abundance when comparing SLE versus healthy controls at T0 and T6
The values of the dcor statistic for testing the difference in cell type abundance when comparing SLE versus healthy controls at T0 and T6Please see Supplementary Section S5 for a second data analysis example containing a continuous predictor (Section 1.1.2).
4 Discussion
We proposed an appropriate kernel to quantify similarities in the cell-type abundance profiles for mass cytometry data using the AD. The proposed kernel has the desirable properties of the AD, and also performs well for both simulated and real datasets. Unlike some existing methods, our framework CODAK can also adjust for covariates, both categorical and continuous.One common issue for testing cell-type abundance for CyTOF is that the number of cell types can often be large and the sample sizes are often small. Many statistical tests, e.g. the tests for the GLMM, are asymptotic tests that do not perform well for such small sample scenarios. In particular, these tests are often anticonservative while resampling versions of these tests can be computationally intensive. We have shown that other tests not specifically requiring large samples in theory can also be anticonservative. Our simulation studies demonstrate that CODAK has far superior type-I error control and competitive power when compared to the state-of-the-art methods. CODAK is also non-parametric in nature, and therefore expected to be more robust compared to existing parametric models against model violations such as violation of distributional assumptions. In one example, we even show that the CODAK-sk method is somewhat robust against violation of the independence assumption and can be used, with caution, for repeated measures data if the repeated measures are expected to be only mildly correlated.Another important feature of CODAK is that it tests for the global null hypothesis of association of the predictors with the full compositional profile of the cell types, and therefore does not suffer from the burden of multiple testing. Such gain is most significant when a number of cell types are associated with the predictor variable. Figure 4 provides an insight into when this approach is most advantageous compared to univariate testing using GLMM or diffcyt-voom. CODAK is also able to test association for continuous predictors while some other methods (e.g. diffcyt) do not provide an easy way to do so.In this article, we have shown the ability of kernel machines to have high power for testing for compositional associations with clinical outcomes, which mirrors its success in other settings (e.g. Rudra ; Wu ). Nevertheless, because the methodology operates at the level of samples and not individual features, one criticism is its interpretability. As suggested in Section 2.3, one can follow up using existing component-level tests and use gatekeeper type methods to adjust for multiple comparisons. We also suggested two kernel-based approaches to rank the individual features in order of their contribution to the overall association. Another finding is that the proposed method is powerful when there are small differences in the abundance of several cell types. By contrast, if only a small subset of cell types has a difference, the simulations show that our method loses power. However, in immunological disease processes, most pathology affects multiple immune cell subsets and different downstream functional effects (Galbraith ; Waugh ).Although we have developed the CODAK framework with CyTOF data in mind, it can potentially be applied more generally for other single-cell platforms such as single-cell RNASeq data. However, it requires more exploration to ensure that the approach performs well for these other data types for which the nature of overdispersion, zero-inflation and data dimensionality might be different compared to CyTOF data. We plan to explore it separately in the future.Finally, CODAK using the AD kernel cannot handle zero proportions. In our experience, it is uncommon to have zeros in the CyTOF cell-type abundance data compared to some other data types such as microbiome data. However, it is not impossible to have a small number of zeros in the compositional cell-type abundance data from CyTOF. In such situations the user can either use (i) the BC kernel (ii) an existing zero-imputation method to replace the zeros before analysis. Since the performance of the BC kernel was not satisfactory, we suggest using the zero-imputation strategy. Several methods of zero-imputation for compositional data exist in the literature [Martín-Fernández , 2012), see Martín-Fernandez and Pawlowsky-Glahn and Buccianti (2011) for a detailed discussion]. The simplest method in the literature is to add a pseudocount, i.e. a small number, often 1, to a zero count (Mandal ; Xia ). We have performed a simulation study to explore the performance of CODAK after adding a pseudocount, and we found that it controls the type-I error and loses only a small amount of power compared to the situation with no zeros (Supplementary Fig. S9). We believe that more complicated kernel-based approaches modeling the probabilities of zeros is beyond the scope of this paper.
5 Conclusion
We have developed a statistical framework based on kernel distance covariance to test association between compositional profiles of cell type abundance with important predictors for mass cytometry data. Our framework can scale up well for high-dimensions and performs well even in small samples. We also proposed methods for covariate adjustment as well as follow-up methods for finding the top cell types contributing to the association. Using extensive simulation studies, the method has been shown to perform well compared to the existing methods under many scenarios. We also demonstrated the performance of the method in real mass cytometry datasets. The approach has further potential to find application for more complex applications such as immunogenomics for multidimensional predictors. With rising applications of CyTOF, our framework provides an important contribution toward the analysis of high-dimensional cell-type abundance data.
Author contributions
P.R. and D.G. conceived the methodology, R.B. and E.H. conducted the experiment(s) and collected the data, P.R. conducted data analysis and simulations. P.R. wrote the manuscript. All authors reviewed the manuscript.
Funding
This work was supported in part by funds from the National Institute of Arthritis and Musculoskeletal and Skin Diseases [K23AR070897, E.H. and R.B.], Grohne-Stepp Endowment from the University of Colorado Cancer Center (D.G.) and the Boettcher Foundation Webb-Waring Biomedical research grant (E.H. and R.B.).Conflict of Interest: none declared.Click here for additional data file.
Authors: Robert V Bruggner; Bernd Bodenmiller; David L Dill; Robert J Tibshirani; Garry P Nolan Journal: Proc Natl Acad Sci U S A Date: 2014-06-16 Impact factor: 11.205
Authors: K Alaine Broadaway; David J Cutler; Richard Duncan; Jacob L Moore; Erin B Ware; Min A Jhun; Lawrence F Bielak; Wei Zhao; Jennifer A Smith; Patricia A Peyser; Sharon L R Kardia; Debashis Ghosh; Michael P Epstein Journal: Am J Hum Genet Date: 2016-03-03 Impact factor: 11.025
Authors: Pratyaydipta Rudra; K Alaine Broadaway; Erin B Ware; Min A Jhun; Lawrence F Bielak; Wei Zhao; Jennifer A Smith; Patricia A Peyser; Sharon L R Kardia; Michael P Epstein; Debashis Ghosh Journal: Genet Epidemiol Date: 2018-03-30 Impact factor: 2.135
Authors: Siddhartha Mandal; Will Van Treuren; Richard A White; Merete Eggesbø; Rob Knight; Shyamal D Peddada Journal: Microb Ecol Health Dis Date: 2015-05-29
Authors: Brian E Vestal; Camille M Moore; Elizabeth Wynn; Laura Saba; Tasha Fingerlin; Katerina Kechris Journal: BMC Bioinformatics Date: 2020-08-28 Impact factor: 3.169