Jinge Yu1, Xiangyu Luo1. 1. Institute of Statistics and Big Data, Renmin University of China, Beijing, 100872, China.
Abstract
MOTIVATION: Spatial transcriptomic techniques can profile gene expressions while retaining the spatial information, thus offering unprecedented opportunities to explore the relationship between gene expression and spatial locations. The spatial relationship may vary across cell types, but there is a lack of statistical methods to identify cell-type-specific spatially variable (SV) genes by simultaneously modeling excess zeros and cell-type proportions. RESULTS: We develop a statistical approach CTSV to detect cell-type-specific SV genes. CTSV directly models spatial raw count data and considers zero-inflation as well as overdispersion using a zero-inflated negative binomial distribution. It then incorporates cell-type proportions and spatial effect functions in the zero-inflated negative binomial regression framework. The R package pscl (Zeileis et al., 2008) is employed to fit the model. For robustness, a Cauchy combination rule is applied to integrate p-values from multiple choices of spatial effect functions. Simulation studies show that CTSV not only outperforms competing methods at the aggregated level but also achieves more power at the cell-type level. By analyzing pancreatic ductal adenocarcinoma spatial transcriptomic data, SV genes identified by CTSV reveal biological insights at the cell-type level. AVAILABILITY: The R package of CTSV is available on https://bioconductor.org/packages/devel/bioc/html/CTSV.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Spatial transcriptomic techniques can profile gene expressions while retaining the spatial information, thus offering unprecedented opportunities to explore the relationship between gene expression and spatial locations. The spatial relationship may vary across cell types, but there is a lack of statistical methods to identify cell-type-specific spatially variable (SV) genes by simultaneously modeling excess zeros and cell-type proportions. RESULTS: We develop a statistical approach CTSV to detect cell-type-specific SV genes. CTSV directly models spatial raw count data and considers zero-inflation as well as overdispersion using a zero-inflated negative binomial distribution. It then incorporates cell-type proportions and spatial effect functions in the zero-inflated negative binomial regression framework. The R package pscl (Zeileis et al., 2008) is employed to fit the model. For robustness, a Cauchy combination rule is applied to integrate p-values from multiple choices of spatial effect functions. Simulation studies show that CTSV not only outperforms competing methods at the aggregated level but also achieves more power at the cell-type level. By analyzing pancreatic ductal adenocarcinoma spatial transcriptomic data, SV genes identified by CTSV reveal biological insights at the cell-type level. AVAILABILITY: The R package of CTSV is available on https://bioconductor.org/packages/devel/bioc/html/CTSV.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The development of spatial transcriptomic techniques has enabled the measurement of gene expression with accompanied spatial context information (Close ; Larsson ; Zhuang, 2021), providing unprecedented opportunities to investigate the interaction between expression and spatial locations. One crucial challenge in the spatial expression data analysis is to identify genes whose expression levels vary with spatial coordinates in a tissue section, which are termed as spatially variable (SV) genes. In recent years, the task of SV gene detection draws much attention from bioinformaticians, and several statistical methods (Edsgärd ; Hao ; Li ; Sun ; Svensson ; Zhu ) have been proposed to test the dependence of expression on spatial locations. However, the dependence may be confounded by some biological or technical factors. In this article, we aim to mitigate the confounding issues in SV gene identification by accounting for two possible confounding factors—cell-type proportions and excessive zeros.On the one hand, the commonly used spatial transcriptomics (ST) platforms, including ST based on spatially barcoded microarrays (Ståhl ), 10× Genomics Visium (Rao ) and Slide-seq (Rodriques ), profile gene expression from spots that are regularly organized in a grid in a tissue section. Each spot usually consists of dozens of cells, so the observed expression measurements are at the bulk level rather than at single-cell resolution. Since spots in different tissue regions often have different cell-type proportions (Cable ; Elosua-Bayes ), the latent cellular compositions can induce expression variations even though the spatial locations have no impact on the expression, thus confounding the SV gene detection. In fact, the confounding issue by cell-type proportions has been also observed in other types of association studies, e.g. the epigenome-wide association studies (Luo ; Rahmani ; Zheng ). On the other hand, unlike traditional bulk RNA-seq or microarray data, the bulk ST expression still suffers from zero-inflation because the expression signals for a large proportion of genes within each spot are too weak to be captured by ST technologies. Figure 1a shows a bar plot of spot-wise zero proportions in a real bulk ST dataset (Moncada ), and we can observe that more than 80% of spots have at least 70% zeros in the expression. Therefore, it is necessary to account for cell-type proportions and sparsity when modeling bulk ST data.
Fig. 1.
Zero-inflation and overdispersion in the pancreatic ductal adenocarcinoma (PDAC) ST data. (a) Bar plot of spot-wise zero proportions. (b) Scatter plot of genes’ expression variance at logarithmic scale versus expression mean in PDAC data. Each point corresponds to one gene, and the smooth curve corresponds to the points where the mean equals variance
Zero-inflation and overdispersion in the pancreatic ductal adenocarcinoma (PDAC) ST data. (a) Bar plot of spot-wise zero proportions. (b) Scatter plot of genes’ expression variance at logarithmic scale versus expression mean in PDAC data. Each point corresponds to one gene, and the smooth curve corresponds to the points where the mean equals varianceIn bulk ST data, a gene is called SV if it displays an expression pattern that depends on the spatial locations of spots in a tissue section. Considering that a tissue consists of diverse cell types, it naturally brings in the concept of cell-type-specific SV genes: as long as the expressions of one gene are affected by the spatial coordinates of cells of the same type, we then call this gene cell-type-specific SV. However, an SV gene may not be cell-type-specific SV and vice versa. For a simple illustration, in Figure 2a, this gene is SV across the spots, but its expression does not vary within each cell type. On the other hand, in Figure 2b, this gene is cell-type-specific SV in both cell types 1 and 2, but its overall expressions on spots do not change. In practice, it is more likely that a gene is SV at the aggregated level and exhibits SV expressions in one or some cell types but does not in others (e.g. Fig. 2c). In this sense, cell-type-specific SV genes may be different from SV genes, and the direct detection of cell-type-specific SV genes can uncover biological context information. Therefore, there is a pressing need for new statistical methods to capture cell-type-specific SV genes.
Fig. 2.
A simple illustration of SV genes and cell-type-specific SV genes. A big circle represents a spot, a solid/empty triangle means a cell of cell type 1 with expression 1/0 and a solid/empty square is a cell of cell type 2 with expression 1/0. The left panels show the gene expression distribution for each cell type, while the right panels display the aggregated expression pattern, where the number at the lower left of each spot is the aggregated expression value. (a) This gene is SV at the aggregated level (right panel), but its expression keeps unchanged within each cell type (left panel). (b) This gene is not SV at the aggregated level (right panel), but its expression is associated with cell locations for each cell type (left panel). (c) This gene is SV at the aggregated level and is cell-type-1-specific SV, but it is not SV in cell type 2
A simple illustration of SV genes and cell-type-specific SV genes. A big circle represents a spot, a solid/empty triangle means a cell of cell type 1 with expression 1/0 and a solid/empty square is a cell of cell type 2 with expression 1/0. The left panels show the gene expression distribution for each cell type, while the right panels display the aggregated expression pattern, where the number at the lower left of each spot is the aggregated expression value. (a) This gene is SV at the aggregated level (right panel), but its expression keeps unchanged within each cell type (left panel). (b) This gene is not SV at the aggregated level (right panel), but its expression is associated with cell locations for each cell type (left panel). (c) This gene is SV at the aggregated level and is cell-type-1-specific SV, but it is not SV in cell type 2For common SV gene detection, frequentist methods carry out multiple hypothesis testings (non-SV in the null and SV in the alternative) and determine the P-value threshold by controlling the false discovery rate (FDR), and Bayesian methods calculate the posterior probability of being SV for each gene using posterior samples and identify SV genes based on estimated Bayesian FDR. Specifically, to our knowledge, trendsceek (Edsgärd ) and SpatialDE (Svensson ) are the first two statistical methods to achieve that. Trendsceek (Edsgärd ) was built upon the marked point process to test whether the joint probability of expressions on two locations relies on their distance, calling it a mark segregation. It then makes use of four types of mark-segregation summary statistics to compute P-values through permutations. As trendsceek models the probability density, it can capture spatial expression changes both from mean and covariance. In contrast, SpatialDE (Svensson ) only models the spatial covariance structure using zero mean Gaussian process (Williams and Rasmussen, 2006) and fits spatial expression data via a normal distribution, and then compares the result against a null model without spatial effects to calculate P-values. Recently, Hao proposes SOMDE using self-organizing maps to enhance the computational scalability on large-scale data. However, these methods need to first transform raw expression count data to continuous values, and this may lose power in the downstream analysis (Sun ).SPARK (Sun ) is an elegant and powerful statistical method that directly fits spatial raw counts via the Poisson log linear regression model and uses the zero mean Gaussian process to model spatial effects. Hence, it can achieve more power than trendsceek and SpatialDE. It also maintains robustness by considering multiple kernel choices of the Gaussian process and combining multiple P-values through a Cauchy combination rule (Liu ). Nevertheless, a simple Poisson distribution cannot account for excess zeros (Fig. 1a) and overdispersion (Fig. 1b) in the ST expression data. Recently, BOOST-GP (Li ) explicitly models the sparse spatial expression via a zero-inflated negative binomial distribution, where the negative binomial mean is connected to covariates through a log link. Spatial effects are further incorporated via zero mean Gaussian process, and binary indicators are introduced for SV genes. Subsequently, the inference is performed in the Bayesian framework, and the posterior samples of SV gene indicators are used to calculate the posterior inclusion probability. Finally, SV genes are selected based on a controlled estimated Bayesian FDR.Instead of the explicit modeling of zero-inflation in BOOST-GP, Zhu designs a nonparametric approach SPARK-X that does not need to specify the distribution of sparse spatial expression. SPARK-X extends the scalability of SPARK and further improves its robustness on large-scale spatial transcriptomic data. Moreover, as far as we know, currently SPARK-X (Zhu ) is the unique SV gene detection method that provides a way to identify cell-type-specific SV genes. Specifically, when applied to Slide-seq v2 data and HDST data, SPARK-X first uses the cell-type proportion estimates from RCTD (Cable ) to assign each spot to its major cell type and then detects SV genes for spots of the same labeled cell type. Nevertheless, the assignment procedure ignores the influence of minor cell types in each spot, and thus it is more reasonable to directly utilize the cell-type proportion estimates to identify cell-type-specific SV genes.In this article, we develop a simple statistical approach ‘CTSV’ to identify cell-type-specific SV genes accounting for excess zeros. CTSV directly fits the sparse expression raw counts using a zero-inflated negative binomial distribution, models the mean as a weighted average of cell-type-specific spatial expression profiles with weights being the cell-type proportions, and for each cell type connects the spatial expression profile to a function of spatial coordinates. By combining these equations in CTSV, the identification of cell-type-specific SV genes is equivalent to testing whether the function of spatial coordinates is zero for each cell type in a zero-inflated negative binomial regression model. Specifically, since there have been several mature bulk ST deconvolution methods (Cable ; Dong and Yuan, 2021; Elosua-Bayes ), we treat the estimated cell-type proportions as fixed covariates in CTSV. We further model unknown functions to be linear, focal and periodic, respectively, and combine the P-values from the multiple choices to achieve the robustness to unavailable spatial patterns like in SPARK (Sun ). Through simulation studies, CTSV can achieve more power than SPARK-X in detecting cell-type-specific SV genes and also outperforms other methods at the aggregated level. The real-data analysis to pancreatic ductal adenocarcinoma (PDAC) ST data also shows the practical utility of CTSV.The novelty of this work can be reflected in the following two main aspects. First, from the perspective of biology, our article first introduces the concept of cell-type-specific SV genes and highlights its importance and difference from SV genes. Second, the statistical construction procedure of CTSV is novel. CTSV explicitly incorporates the cell-type proportions of spots into a zero-inflated negative binomial distribution and models the spatial effects through the mean vector, whereas existing SV gene detection approaches either do not directly utilize cellular compositions or do not account for excess zeros. It is the subtle construction of CTSV that makes it possible to correctly detect cell-type-specific SV genes from bulk ST data, which will be detailed in the next section.
2 Materials and methods
2.1 The proposed approach CTSV
Suppose there are G genes, n spots and K cell types in the tissue section. Assume that is the bulk ST data matrix, where Y is the observed raw count of gene g in spot i. Let represent the set of coordinates of spots’ centers, and is the two-dimensional coordinate of spot i’s center. To account for the count nature and overdispersion of ST data, we consider the negative binomial distribution with mean and shape parameter ψ for gene g in spot i, and its probability mass function is for any non-negative integer x. In this way, the variance equals and thus is larger than the mean . The scalar c is a size factor to account for different library sizes of spots, and it is computed to be the ratio of spot i’s library size to the median library size across spots, i.e. .In addition to overdispersion, bulk ST data may suffer from zero-inflation—the observed zero proportion is much larger than the expected zero proportion of a negative binomial distribution. Typically there are two kinds of zeros in the data. One is called ‘biological zeros’ resulting from genes that do not express, and the other one is ‘technical zeros’ or ‘dropout zeros’, which means that some genes have relatively low expressions but are not captured. Taking both overdispersion and zero-inflation into consideration, we model the count data Y by a zero-inflated negative binomial distribution,
where π denotes the probability of being a technical/dropout zero for gene g in the spots and δ0 is a Dirac measure with point mass at zero.As one spot may consist of dozens of heterogeneous cells, we model the log scale of λ as a mix of cell-type-specific relative expression levels of gene g in spot i,
w is the cell-type k proportion in spot i, and μ represents the relative mean expression level of gene g for cell type k in spot i. μ depends on the spot i through its location , and the relationship is modeled as follows using a similar formulation from Luo .
where η is the cell-type-k baseline expression level of gene g, the two functions and describe the spatial effects on the mean η, and the coefficients and are of our interest that can reflect whether the location affects the expression of gene g in cell type k. Subsequently, by combining Equations (1)–(3), we arrive at the proposed approach CTSV (Cell-Type-specific SV gene detection),If we integrate the last two equations, CTSV is equivalent toOur next goal is to conduct statistical inference for the coefficients and to test whether they are zero or not for each gene. Specifically, if at least one of the two null hypotheses and is rejected, then we believe that gene g is SV in cell type k.
2.2 Statistical inference
2.2.1 When functions h1 and h2 are known
In Equation (4), if we know the cellular compositions for each spot i as well as the functions h1 and h2, then we can treat them as covariates and thus the inference for CTSV reduces to the inference for a zero-inflated negative binomial regression model (Preisser ), which can be easily conducted by the R package pscl (Zeileis ). However, the cellular compositions of each spot are often unavailable. Fortunately, there have been several deconvolution methods designed for bulk ST data recently, such as RCTD (Cable ), SPOTlight (Elosua-Bayes ) and SpatialDWLS (Dong and Yuan, 2021). Subsequently, we treat the estimates for as fixed covariates and plug them in Equation (4).The parameter estimation in the zero-inflated negative binomial distribution is not trivial. For example, Miao used EM algorithm (Dempster ) to estimate the dropout zero probability π. For each gene, CTSV is essentially a zero-inflated negative binomial regression model, and the likelihood function can be written as follows (gene index g is suppressed for simplicity).
where the parameter set is . We follow the estimation strategy from Zeileis to obtain approximated maximum likelihood estimates for . Specifically, we utilize the conjugate gradient (CG) algorithm (Gilbert and Nocedal, 1992) to minimize the negative logarithmic likelihood () with warm starting values being the iteratively reweighted least squares estimates (Green, 1984).Next, based on the R package pscl (Zeileis ), we can obtain the P-value for the hypothesis vs for gene g in cell type k along the -th coordinate via Wald tests. Notice that as the inference is carried out for each gene independently, the procedure is highly parallelable. We also remark that the usage of pscl is just a computational tool to realize the statistical inference for regression coefficients in CTSV, which does not damage the novelty of CTSV. All the P-values can be organized into a P-value matrix with dimension , where the k-th column corresponds to the P-value vector in cell type k for the s1 coordinate and the -th column to the P-value vector in cell type k for the s2 coordinate. To control the FDR in the multiple hypothesis testings, we convert the P-value matrix to the q-value matrix using the R package qvalue (Storey ; Storey and Tibshirani, 2003). In this way, a q-value threshold α controls the FDR to be not larger than α.Specifically, for each g-th row in the q-value matrix, if there is at least one q-value in this row less than α, we call the corresponding gene g SV at the aggregated level. For each cell type k, if there is at least one q-value in less than α, we then identify the gene g to be cell-type-k-specific SV.
2.2.2 When functions h1 and h2 are unknown
In practice, we often do not know what the type of underlying spatial patterns is in the tissue section for each gene. To deal with possible model misspecification and make the CTSV method more robust, we follow the idea from Sun to choose three types of functions for h1 and h2, which can reflect the linear, focal and periodic spatial expression patterns. Specifically, suppose that and are first transformed to have mean 0 and standard deviation 1. We choose linear functions as and , squared exponential functions and , and periodic functions and . Moreover, for the squared exponential functions, we choose two sets of scale length parameters by (i) letting σ1 and σ2 be the 40% quantile of the absolute values of the transformed and , respectively, denoted by ; and (ii) letting . Similarly, for periodic functions, we set (i) and (ii) . Hence, for each gene g in cell type k along -th coordinate, we obtain five P-values.Accordingly, for gene g in cell type k along -th coordinate, we combine the five P-values following the Cauchy combination rule ACAT (Liu ). We first convert each of the five P-values into a Cauchy statistic , then take an average of them , and transform the average into a single P-value , where C follows the standard Cauchy distribution (Liu ; Pillai and Meng, 2016). In this way, we convert five P-value matrices to one P-value matrix , and then the inference is based on the FDR control as discussed before.
3 Simulation
In this section, we compared the performance of our method with several state-of-the-art SV gene detection methods. We generated the spatial transcriptomic raw count data following Equation (4), where related parameters are set as follows. Suppose there are G = 10 000 genes, n = 600 spots and K = 6 cell types. The cell-type-k baseline expression profile was generated from normal distributions. Specifically, we first independently simulated from for in cell type 1 and then randomly sampled 300 differentially expressed (DE) genes for each cell type k . Next, on the cell-type-k DE genes , we sampled η from independently, where . For expressions on the remaining genes, we set . The explanations for the parameter choices are given in Supplementary Section S1. Moreover, we partitioned the spot region into four regions as displayed in Figure 3a and then sampled cell-type proportions of spot i from Dirichlet distributions. Cell-type proportions of spots in regions from 1 to 4 were independently sampled from and , respectively. For coefficients β, we set 200 SV genes in each cell type, and there were 700 SV genes at the aggregated level. Figure 3b shows the SV gene distribution patterns in each cell type. We further consider the following three simulation settings to specify the spatial effects h1 and h2.
Fig. 3.
Spot regions and the heatmap of cell-type-specific SV gene pattern. (a) Four spot regions with different colors. (b) Heatmap of the SV gene pattern. If one gene in a cell type is SV, then it is colored by black. Only the first 1000 genes are shown for a good visualization because all the remaining genes are not SV
Spot regions and the heatmap of cell-type-specific SV gene pattern. (a) Four spot regions with different colors. (b) Heatmap of the SV gene pattern. If one gene in a cell type is SV, then it is colored by black. Only the first 1000 genes are shown for a good visualization because all the remaining genes are not SVFor the linear spatial pattern as shown in Figure 4a, we chose and . For SV genes, we set and for each cell type. For non-SV genes, was set to be zero.
Fig. 4.
SV genes’ spatial expressions in (a) linear pattern, (b) focal pattern and (c) periodic pattern, where the coordinates are scaled to have mean zero and standard deviation one. (d–f) The ROC curves with false positive rate (FPR) controlled to be <0.05 for CTSV, SPARK-X, SPARK, BOOST-GP, SpatialDE, trendsceek and SOMDE in the three spatial expression patterns. Only the FPR range is shown because in medical and clinical practice we often need to control FPR to be less than a threshold and some range of thresholds may be more important than others (Pencina ). To provide more information, the ROC curves over the whole FPR range (0, 1) are also given in the Supplementary Figure S2
For the focal spatial pattern as shown in Figure 4b, we set and . For SV genes in each cell type, we set and . For non-SV genes, was set to be zero.For the periodic spatial pattern as shown in Figure 4c, we have . For SV genes in each cell type, we set and . For non-SV genes, was set to be zero.SV genes’ spatial expressions in (a) linear pattern, (b) focal pattern and (c) periodic pattern, where the coordinates are scaled to have mean zero and standard deviation one. (d–f) The ROC curves with false positive rate (FPR) controlled to be <0.05 for CTSV, SPARK-X, SPARK, BOOST-GP, SpatialDE, trendsceek and SOMDE in the three spatial expression patterns. Only the FPR range is shown because in medical and clinical practice we often need to control FPR to be less than a threshold and some range of thresholds may be more important than others (Pencina ). To provide more information, the ROC curves over the whole FPR range (0, 1) are also given in the Supplementary Figure S2After obtaining , and , we can calculate and then sample Y from , where the shape parameter is and c = 1. Considering ST data have a large proportion of zeros, we set π () to be 0.6 in each spatial pattern. Therefore, for each gene, the count data were set to be dropout zero with a probability 0.6. Subsequently, we applied the proposed method CTSV to the three types of simulated ST data and compared the performance with trendsceek (Edsgärd ), SpatialDE (Svensson ), SPARK (Sun ), SPARK-X (Zhu ), BOOST-GP (Li ) and SOMDE (Hao ). Their implementation details are given in Supplementary Section S2.When implementing CTSV, we considered the estimate error for the cell-type proportions and sampled from with . In addition, if not available (NA) is returned by the function zeroinfl in R package pscl (Zeileis ), the corresponding P-value is recorded as one. In the argument of function zeroinfl, some commonly used optimization methods can be used, such as BFGS, CG or Nelder-Mead, and we applied CG algorithm for its stability during the optimization procedure. We displayed the histogram for the absolute estimation error in Supplementary Figure S1, showing that the estimation errors concentrate on very small values. Hence, the estimation for the dropout zero probability has slight effects on the detection of cell-type-specific SV genes.The receiver operating characteristic (ROC) curves for identifying SV genes at the aggregated level in the three simulation settings were reported in Figure 4d–f, respectively, where the false positive rate is controlled to be <0.05 for a good visualization of the performance comparison. The partial ROC curves indicate that CTSV uniformly outperformed other methods in SV gene detection at the aggregated level. In each setting, the performance of CTSV was followed by SPARK-X, which also performs well due to its nonparametric nature. SPARK ranked the third for the linear and periodic settings, while SOMDE ranked the third in the focal spatial pattern. SpatialDE, trendsceek and BOOST-GP fail to achieve enough power in all the three simulation settings. Note that trendsceek has four types of statistics, and we only showed the best one. When controlling the FDR <0.01 for each method (i.e. the q-value threshold is 0.01), Table 1 demonstrates the true positive rates (TPRs) and the number of false positives (FP) in the three spatial expression patterns for all the methods. CTSV and SPARK-X gave much higher TPR than other methods, while the FP of CTSV was slightly larger than SPARK-X. We also observed that trendsceek, SpatialDE and SOMDE cannot identify any SV gene with FDR <0.01. Therefore, at the aggregated level, CTSV can provide a high power with controlled FP and FDR owing to its ability to handle excess zeros and account for cell-type proportions.
Table 1.
The comparisons of true positive rate (TPR) and the number of false positives (FP) in SV gene detection at the aggregated level
Pattern
CTSV
SPARK-X
SPARK
BOOST-GP
SpatialDE
SOMDE
Trendsceek
TPR
Linear
0.999
0.907
0.178
0.001
0
0
0
Focal
0.871
0.293
0
0.001
0
0
0
Periodic
0.999
0.819
0
0.003
0
0
0
FP
Linear
33
1
0
5
0
0
0
Focal
19
3
0
5
0
0
0
Periodic
21
3
0
4
0
0
0
The comparisons of true positive rate (TPR) and the number of false positives (FP) in SV gene detection at the aggregated levelRegarding the detection of cell-type-specific SV genes, as SPARK-X is currently the only method that can achieve the function, we compared CTSV and SPARK-X. In SPARK-X (Zhu ), if one spot was dominated by a cell type, which has the maximal proportion in that spot, SPARK-X assigned the spot to the cell type. Subsequently, SPARK-X performed the detection task on spots with the same cell type. Figure 5 displays the heatmaps of () of CTSV and SPARK-X, where P is the P-value of gene g in cell-type k for SPARK-X, and for CTSV. The darker the color, the more significant that the corresponding gene is SV in that cell type. Compared with the underlying truth (Fig. 3b), CTSV obtained more accurate results in identifying cell-type-specific SV genes than SPARK-X. Table 2 indicates that when FDR is controlled to be <0.01, CTSV yielded higher power than SPARK-X for all the cell types in the three simulation settings, but CTSV did not perform very well in the focal spatial expression pattern. The results showed that CTSV is good at identifying cell-type-specific SV genes by directly modeling cell-type proportions rather than transforming them to one-hot code like in SPARK-X, which may lose some information.
Fig. 5.
(a–c) Significance plots of CTSV and (d–f) significance plots of SPARK-X in the three spatial expression patterns for the first 1000 genes. Values in the heatmaps are of the corresponding gene in each cell type. The darker the color, the more likely the corresponding gene is to be SV in that cell type
Table 2.
Cell-type-specific TPR and FP of CTSV and SPARK-X
Pattern
Linear
Focal
Periodic
Methods
CTSV
SPARK-X
CTSV
SPARK-X
CTSV
SPARK-X
TPR
Cell-type 1
1
0.375
0.800
0
1
0
Cell-type 2
0.995
0.095
0.785
0
0.940
0
Cell-type 3
0.980
0
0.605
0
0.970
0
Cell-type 4
0.975
0
0.515
0
0.980
0
Cell-type 5
0.995
0
0.780
0
0.970
0
Cell-type 6
0.995
0
0.905
0
0.995
0
FP
Cell-type 1
35
33
9
0
1
0
Cell-type 2
22
4
9
0
1
0
Cell-type 3
10
0
5
0
7
0
Cell-type 4
11
0
8
0
6
0
Cell-type 5
3
0
9
0
3
0
Cell-type 6
21
0
119
0
5
0
(a–c) Significance plots of CTSV and (d–f) significance plots of SPARK-X in the three spatial expression patterns for the first 1000 genes. Values in the heatmaps are of the corresponding gene in each cell type. The darker the color, the more likely the corresponding gene is to be SV in that cell typeCell-type-specific TPR and FP of CTSV and SPARK-XImperfect deconvolution. To explore the effects of imperfect cell-type proportion estimations on the SV gene discovery, we performed additional experiments under the three spatial patterns. When implementing CTSV, we sampled the cell-type proportion estimates for each spot i, , from ( is the underlying truth) with the concentration parameter , respectively. The lower the α0, the less accurate the deconvolution estimates. Supplementary Figure S3 shows that the imperfect deconvolution may lead to more FP for linear and periodic patterns and decrease power for the focal pattern. Fortunately, when (i.e. the deconvolution is not much bad), the performances of CTSV in most cell types are satisfactory.Model misspecification. We carried out model misspecification experiments where data were generated from a different model. Specifically, we introduced the zero-inflated Poisson log-normal regression model to generate the expression count data, and . In each spatial pattern, we set the standard deviation and other parameters are the same as those in the original simulation study. CTSV and competing approaches were then applied. The ROC curves in Supplementary Figure S4 and TPR/FP comparison in Supplementary Table S1 show that CTSV can outperform SPARK-X and other methods for . However, when τ increases, the performance of SPARK-X begins to be better than CTSV due to a larger gap between the generating distribution and the assumed zero-inflated negative binomial distribution. Fortunately, from the perspective of detecting cell-type-specific SV genes, CTSV can still achieve relatively high accuracy for (Supplementary Tables S2–S4). Therefore, when ST data do not follow the zero-inflated negative binomial distribution, the nonparametric approach SPARK-X may outperform CTSV at the aggregated level. We leave the extension of CTSV to a nonparametric approach as a future work.Missing cell types. The notation K is the number of cell types across all spots in the studied tissue section. We acknowledge that it is possible that each spot i can have its own cell type number K due to the increasing resolution of spatial transcriptomics. Fortunately, our model can be easily adapted to this situation. For example, if there are K = 6 cell types in total and spot i only has three cell types (e.g. 30% cell type 1, 45% cell type 3 and 25% cell type 6), then we can let the cell-type proportion for spot i be , which is also a K = 6 dimensional vector. To evaluate the performance of CTSV in this ‘missing cell type’ case, we randomly chose some spots with the missing cell type number from 1 to 5 (i.e. ), resulting in 21 spots with only 1 cell type, 40 spots with 2 cell types, 61 spots with 3 cell types, 58 spots with 4 cell types, 52 spots with 5 cell types and 368 spots with all the 6 cell types. For the three spatial patterns, Supplementary Figure S5 and Table S5 show that CTSV also achieves good performances in detecting cell-type-specific SV genes.Mixed spatial patterns. We considered three types of mixed spatial patterns: (i) h1 is linear and h2 is periodic, where and ; (ii) h1 is linear and h2 is focal, where and ; and (iii) h1 is periodic and h2 is focal, where and . In each setting, we implemented the CTSV method described in Sections 2 and 3, where h1 and h2 used in CTSV still belong to the same pattern, so these experiments are actually also model misspecification cases. Supplementary Figure S6 and Table S6 illustrate that under mixed spatial patterns, CTSV also outperforms other competing methods and achieves higher TPR with q-value threshold 0.01. Importantly, CTSV can still achieve relatively high accuracy in detecting cell-type-specific SV genes (Supplementary Table S7).Increased dropout zero proportions. To evaluate whether the FP number of CTSV increases with the dropout zero proportion π, we implemented two additional settings where and 0.8. The corresponding results are shown in Supplementary Figure S7 and Tables S8–S10. Compared with other methods, we observe that CTSV is more robust to the dropout zero proportion, and the number of FP does not increase with π.Computational speed. We set the spot size as 600, 1000, 2000, and 5000 to investigate the computational time of CTSV. The average execution time per gene were 9.608 s, 12.187 s, 22.447 s and 45.673 s, respectively, using 4 cores for paralleling. The experiments were implemented on a MacBook Pro computer with Intel Core i5, 4 cores, 8 GB memory and 2.40 GHz.
4 Real-data analysis
We applied CTSV to PDAC ST data (Moncada ), which can be downloaded from Gene Expression Omnibus (Edgar ) with accession code GSE111672, and our analysis focuses on the ST1 data from PDAC Patient A. As there are associated scRNA-seq data with 18 cell types for Patient A, we employed the deconvolution approach SPOTlight (Elosua-Bayes ) to obtain cell-type proportion estimates of each spot. SPOTlight is based on a seeded non-negative matrix factorization regression algorithm. It uses the ST data, scRNA-seq data and a set of marker genes as input, and applies non-negative least squares iteratively to carry out the deconvolution.We remark here that the deconvolution is a nontrivial task and current methods do not perform equally well in different situations, so data analysts need careful considerations in choosing suitable deconvolution tools in their own problems. Here, in the PDAC data analysis, we chose SPOTlight (Elosua-Bayes ) mainly for two reasons. First, SPOTlight was shown to have higher accuracy and sensitivity than other state-of-the-art deconvolution approaches based on synthetic mixture data, and it can be flexibly applied to different technical conditions and protocols. Second, the performance of SPOTlight on the PDAC data has been biologically validated, and the deconvolution results provide many insights into tumor regions (Elosua-Bayes ).We then merged cancer clones A and B into one cell type denoted by ‘cancer cell’, and combined macrophages A and B to one cell type named ‘macrophages’. To alleviate the effects of rare cell types, we calculated the 80th percentile of proportions across spots for each cell type and removed cell types whose 80th percentile is <0.1. After the procedure, six cell types—antigen presenting ductal cells, centroacinar ductal cells, high/hypoxic ductal cells, terminal ductal cells, cancer cells and macrophages—were remained for downstream analysis, and their proportions were adjusted such that they are positive and summed to be one.Subsequently, we filtered out genes that are expressed in <20 spots and kept all spots, resulting in 4070 genes and 428 spots. The justification for using a zero-inflated distribution in CTSV in this dataset is provided in Supplementary Section S3. We afterward applied CTSV, trendsceek (Edsgärd ), SpatialDE (Svensson ), SPARK (Sun ), SPARK-X (Zhu ), SOMDE (Hao ) and BOOST-GP (Li ) to the processed bulk ST data. Because trendsceek and SOMDE did not detect any SV gene in PDAC dataset, we did not display them in the downstream comparisons. The Venn plot (Fig. 6) shows the SV gene overlap among CTSV, SpatialDE, SPARK, SPARK-X and BOOST-GP. When q-value threshold is 0.05, CTSV identified 61 SV genes from 4070 genes at the aggregated level, around half of which were also detected by SpatialDE, SPARK, SPARK-X and BOOST-GP. In contrast, each of the competing methods detected more than 800 SV genes.
Fig. 6.
Venn plot of SV genes detected by CTSV, SPARK, BOOST-GP, SPARK-X and SpatialDE in the PDAC data. The number in the parentheses indicates the total number of SV genes detected by that method
Venn plot of SV genes detected by CTSV, SPARK, BOOST-GP, SPARK-X and SpatialDE in the PDAC data. The number in the parentheses indicates the total number of SV genes detected by that methodFor the identification of cell-type-specific SV genes, we compared the performance between CTSV and SPARK-X. In SPARK-X, each spot was assigned to the major cell type of that spot, and then SPARK-X was applied to spots that belong to the same cell type. Table 3 shows the SV gene number in each cell type for the two methods as well as the number of overlapping SV genes. We also provided the spatial expression patterns of cancer-cell-specific SV genes detected by CTSV for spots with cancer cells being the major cell type component (Fig. 7a). The distribution of cell types is also displayed in Figure 7b. We observe that the expressions show spatial changes in the cancer regions. Specifically, genes like CEL, CPA1 and CLU show relatively low expression levels in the upper right of the cancer region and have relatively high expression values in the lower middle, indicating the cancer-region-specific spatial expression variation of genes identified by CTSV. The spatial expression patterns of 673 cancer-cell-specific SV genes detected by SPARK-X are also given in Supplementary Figure S8, where more than one half of detected SV genes (e.g. AQP8, HMGB1 and NDN) show insignificant spatial variation.
Table 3.
Number of SV genes in each cell type by CTSV and SPARK-X
Cell types
CTSV
SPARK-X
Overlapping genes
Antigen presenting ductal cells
13
0
0
Centroacinar ductal cells
31
0
0
High/hypoxic ductal cells
6
0
0
Terminal ductal cells
6
0
0
Cancer cells
15
673
9
Macrophages
12
0
0
Fig. 7.
(a) The spatial expression patterns of cancer-cell-specific SV genes detected by CTSV in the cancer region of PDAC data. Values are relative expressions, and the calculation details are given in Supplementary Section S4. (b) The distribution plot of six cell types, where each spot corresponds to a pie chart describing the cell type proportions
(a) The spatial expression patterns of cancer-cell-specific SV genes detected by CTSV in the cancer region of PDAC data. Values are relative expressions, and the calculation details are given in Supplementary Section S4. (b) The distribution plot of six cell types, where each spot corresponds to a pie chart describing the cell type proportionsNumber of SV genes in each cell type by CTSV and SPARK-XIn addition, some cell-type-specific SV genes of CTSV provide some connections with tumor or PDAC. Table 4 displays these genes. For example, ARHGDIB in cancer cells, which was not identified by SPARK-X, encodes the protein RhoGDI2 that functions as a metastasis suppressor in human cancer (Gildea ) and plays an important role in tumor dormancy regulation (Said ). ISG15 found in antigen presenting ductal cells is associated with the reinforcement of cancer stem cells’ self-renewal, invasive capacity and tumorigenic potential in PDAC (Sainz ). In terminal ductal cells, JADE1 may contribute to the development of pancreatic cancer (Liu ). CLPS was detected as an SV gene in more than one cell type, and the pancreatic lipase requires the colipase protein encoded by CLPS for efficient dietary lipid hydrolysis (Lowe, 1997; Van Tilbeurgh ). Thus, the results by CTSV provide some clues for clarifying the underlying tumor mechanisms, which requires further validations by biological experiments.
Table 4.
Cell-type-specific SV genes detected by CTSV
Cell types
SV genes
Antigen presenting ductal cells
AC092798.1, AL139039.2, CEL, CERS5
CLPS, CTRB1, CTRB2
DUOXA2, FP671120.4, GAPDH
GP2, ISG15, MED16
Centroacinar ductal cells
AC009078.2, AC090114.1, C3, C4A
CD63, CD74, CEL, CELA3A
CELA3B, CLPS, COL6A2, CPA1
CPA2, CPB1, CRP, CTRB1
CTRB2, CTRC, DUOXA2, ELF3
FUT11, GP2, HEIH, IFI6
IGHGP, KRT8, LCN2, MMP1
MMP14, MUC5B, NR4A1
High/hypoxic ductal cells
AL139039.2, APBB1, ATXN2L
FYCO1, GALNT14, MMP23A
Terminal ductal cells
AC022558.1, AL139039.2, CLPS
COLGALT2, JADE1, MCRIP2
Cancer cells
AC022558.1, AL139039.2, ARHGDIB, C3
CEL, CHMP6, CLPS, CLU
CPA1, CPB1, CTRB1, CTRB2
ELN, GALNT14, LINC00685
Macrophages
AC073896.4, CBLC, CDKN1A, COLGALT2
DES, ELF3, FGFRL1, FTL
GALNT14, IGFBP4, LNPEP, NSDHL
Cell-type-specific SV genes detected by CTSV
5 Conclusion
In this article, we developed a cell-type-specific SV gene detection method (CTSV) for bulk ST data. CTSV directly models raw count data through a zero-inflated negative binomial distribution, incorporates cell-type proportions and relies on the R package pscl (Zeileis ) to fit the model. To capture different types of spatial patterns, five spatial effect functions are used, and then CTSV applied the Cauchy combination rule (Liu ) to obtain P-values for robustness.In simulation studies, CTSV was not only shown to be the most powerful approach at the aggregated level in the three spatial expression settings, but it also outperformed SPARK-X in terms of cell-type-specific SV gene detection, perhaps due to the direct consideration of cell-type proportions. In the analysis for PDAC data, CTSV also identified reasonable cell-type-specific SV genes that are related to meaningful biological functions.In fact, the spatial information can be incorporated into the Gaussian process in two ways—the spatial effect on the mean vector or the spatial dependency induced by the covariance matrix. Previous methods including SpatialDE and SPARK used the covariance matrix modeling, while CTSV chose the mean to reflect spatial effects for two reasons. First, from the perspective of statistics, it is easier to test the regression coefficients ( with null hypothesis ) in the mean function than the scale parameter ( with null hypothesis ) in the covariance matrix (e.g. SPARK), as the latter is a hypothesis testing at the parameter space boundary and thus needs more complicated statistical techniques. Second, from the perspective of biology, by modeling two axes and separately, we have the opportunity to distinguish which axis may affect the gene expression based on and . For example, it is possible that the expression changes only with and keeps invariant with (see Supplementary Fig. S9). We acknowledge that CTSV can be further equipped with the spatial dependency via the covariance matrix, but this makes the statistical inference difficult and computationally inefficient. Hence, we leave it as a future work.In addition, the performance of SPARK-X and CTSV are close in the simulation but are very different on the real data for the following reasons. On the one hand, the difference between simulation and real data may be due to the different zero-inflation rate. In PDAC ST data, the gene-wise zero proportions have the interquartile range [0.8014, 0.9322], while in simulation the interquartile range of gene-wise zero proportions is [0.5867, 0.6150] with the dropout probability . When we increase π from 0.6 to 0.7 and 0.8, the gap between SPARK-X and CTSV becomes larger in the ROC curves (Supplementary Fig. S7). On the other hand, in the simulation setting, we let the spatial pattern be linear, focal and periodic, respectively, while in the real-data analysis the spatial pattern of SV gene expressions can be more complex, e.g. the combination of two or more patterns. Therefore, these factors may make the statistical performances of CTSV and SPARK-X different between simulation and real-data application.Several extensions are worth exploring in the future. First, for robustness, we choose five simple spatial effect functions for h1 and h2, and it is better to utilize nonparametric statistical methods to directly fit the functions, such as splines or wavelets. Second, it is more helpful to incorporate prior knowledge of the tissue images (Hu ). Third, when it comes to single-cell spatial expression data, we can also apply CTSV by setting the proportion of the cell type to which this cell belongs as one and the proportions of other cell types as zero.Moreover, integrating multiple datasets can borrow strengths across different platforms to increase statistical power. However, due to the different protocols, it may suffer from platform effects. In principle, CTSV may incorporate the platform effects γ in platform b through the following modeling.
where Y is the read count of gene g for spot i in platform b, platform b has an additive effect γ on the gene expression and on the platform one is fixed at zero for identifiability. Moreover, the platform may also affect the variance or the dropout zero proportion π, which makes the statistical inference for CTSV more complex. Therefore, our future direction is to equip CTSV with the ability to address platform effects.Click here for additional data file.
Authors: Patrik L Ståhl; Fredrik Salmén; Sanja Vickovic; Anna Lundmark; José Fernández Navarro; Jens Magnusson; Stefania Giacomello; Michaela Asp; Jakub O Westholm; Mikael Huss; Annelie Mollbrink; Sten Linnarsson; Simone Codeluppi; Åke Borg; Fredrik Pontén; Paul Igor Costea; Pelin Sahlén; Jan Mulder; Olaf Bergmann; Joakim Lundeberg; Jonas Frisén Journal: Science Date: 2016-07-01 Impact factor: 47.728
Authors: Reuben Moncada; Dalia Barkley; Florian Wagner; Marta Chiodin; Joseph C Devlin; Maayan Baron; Cristina H Hajdu; Diane M Simeone; Itai Yanai Journal: Nat Biotechnol Date: 2020-01-13 Impact factor: 54.908