Literature DB >> 33720414

Modeling dynamic correlation in zero-inflated bivariate count data with applications to single-cell RNA sequencing data.

Abstract

Interactions between biological molecules in a cell are tightly coordinated and often highly dynamic. As a result of these varying signaling activities, changes in gene coexpression patterns could often be observed. The advancements in next-generation sequencing technologies bring new statistical challenges for studying these dynamic changes of gene coexpression. In recent years, methods have been developed to examine genomic information from individual cells. Single-cell RNA sequencing (scRNA-seq) data are count-based, and often exhibit characteristics such as overdispersion and zero inflation. To explore the dynamic dependence structure in scRNA-seq data and other zero-inflated count data, new approaches are needed. In this paper, we consider overdispersion and zero inflation in count outcomes and propose a ZEro-inflated negative binomial dynamic COrrelation model (ZENCO). The observed count data are modeled as a mixture of two components: success amplifications and dropout events in ZENCO. A latent variable is incorporated into ZENCO to model the covariate-dependent correlation structure. We conduct simulation studies to evaluate the performance of our proposed method and to compare it with existing approaches. We also illustrate the implementation of our proposed approach using scRNA-seq data from a study of minimal residual disease in melanoma.

Entities: Chemical

Keywords: correlated count data; covariate-dependent correlation; dynamic coexpression; liquid association; single-cell RNA sequencing; zero inflation

Mesh：

Year: 2021 PMID： 33720414 PMCID： PMC8477913 DOI： 10.1111/biom.13457

Source DB: PubMed Journal: Biometrics ISSN： 0006-341X Impact factor: 1.701

INTRODUCTION

Interactions between biological molecules in a cell are tightly coordinated and often highly dynamic (Luscombe et al., 2004; de Lichtenberg et al., 2005). They can change flexibly under different cellular conditions or in response to various external stimulants and signals. As a result of these varying signaling activities, changes in gene coexpression patterns can often be observed in these situations (Li, 2002; Li and Yuan, 2004; de la Fuente, 2010). Studying these dynamic changes in gene coexpression could reveal these intricate underlying gene regulatory mechanisms. Although it is a challenging task to unravel the complex genetic interactions in a biological system, several statistical approaches have been introduced to describe the coexpression between a pair of genes such as Pearson correlation or rank correlation, F‐statistic (Lai et al., 2004), mutual information (Faith et al., 2007), entropy‐based approaches (Ho et al., 2007), Gaussian graphical models (Ma et al., 2007), and Bayesian network (Ho et al., 2014). However, these approaches do not account for the fact that genetic circuits can be turned on or off and genes may participate in different regulatory processes under different cellular conditions. One statistical measure that can capture these dynamic gene correlation changes was proposed by Li (2002). This measure, named dynamic correlation in this paper, quantifies the relationship where the coexpression between two genes is modulated by a third “coordinator” gene. Li (2002) examined these dynamic correlation changes (referred to as liquid association in his paper) in canonical pathways using microarray gene expression data from a model organism, Saccharomyces cerevisiae. For a typical genomic study, a pathway‐based or a genome‐wide screening strategy can be implemented as presented in several studies to effectively identify potential dynamic correlation changes (Dawson and Kendziorski, 2012; Gunderson and Ho, 2014; Wang et al., 2017; Yu, 2018; Kinzy et al., 2019). Li's study and other studies since then have evidently established its biological validity and popularized it to be a useful tool for analyzing genomic data (Li, 2002; Li et al., 2004; Ho et al., 2007; Zhang et al., 2007; Ho et al., 2011; Wang et al., 2013; Khayer et al., 2017; Wang et al., 2017; Xu et al., 2017; Ai et al., 2019; Kong and Yu, 2019; Wen et al., 2020). However, when it comes to count data such as RNA sequencing reads, these existing Gaussian‐based approaches may not fit the data properly. RNA sequencing (RNA‐seq) data are often presented as a count matrix with nonnegative counts as the number of reads observed. Count‐based models such as the Poisson distribution and the negative binomial distribution are widely used to analyze the RNA‐seq data. Karlis and Meligkotsidou (2005) proposed a multivariate Poisson model with covariance structure. Due to both biological and technical variability, RNA‐seq count data are often overdispersed. For overdispersed data, the variance is larger than the mean, which is a violation of the assumption of the Poisson distribution (mean and variance are equal). To handle overdispersion, Solis‐Trapala and Farewell (2005) used a multivariate Poisson‐Gamma mixture model. Robinson et al. (2010) modeled the data using the negative binomial distribution and treated the Poisson distribution as a special case of the negative binomial distribution. Ma et al. (2020) proposed flexible models for modeling bivariate correlated count data. In recent years, the rapid development of next‐generation sequencing technologies has made it possible to examine the sequence information from individual cells. Single‐cell RNA sequencing (scRNA‐seq) analyzes the expression of RNAs from individual cells, whereas traditional RNA‐seq can only analyze the RNAs from mixed cell populations (Bacher and Kendziorski, 2016; Hwang et al., 2018). scRNA‐seq gives insight into individual cells' function and behavior at various stages and in various cell types, and hence, can provide a high‐resolution view of dynamic coexpression regulation in a biological system. However, the analysis of scRNA‐seq data is complicated by high levels of technical noise and intrinsic biological variability (Kharchenko et al., 2014). Due to the low amounts of mRNA within individual cells, the counts of single‐cell gene expression data contain a large number of zero expression measurements. To avoid stochastic zero counts, Lun et al. (2016) developed a normalization method based on pooling expression values. Pierson and Yau (2015) developed a dimensionality‐reduction method considering the dropout characteristics to improves modeling accuracy. Miao et al. (2018) used a zero‐inflated negative binomial model to estimate the proportion of real and dropout zeros. Kharchenko et al. (2014) modeled the measurement of each cell as a mixture of two components: one for transcripts that are successfully detected and the other for dropout events during amplification. Motivated by the dynamic correlation studies in microarray data, in this article, we propose the ZEro‐inflated negative binomial dynamic COrrelation (ZENCO) model. We account for overdispersion and zero inflation in count data by considering a mixture model of conditional bivariate negative binomial regressions and zero counts. A latent variable is incorporated into ZENCO to model the covariate‐dependent correlation structure. We demonstrate the implementation of ZENCO model using the scRNA‐seq data of melanoma cells from Gene Expression Omnibus (GSE116237) and study the difference of dynamic correlations between various phases during combined BRAF and MEK (BRAF/MEK) treatment. The remainder of the article is arranged as follows. In Section 2, the detail of the proposed model is introduced. The simulation studies and comparisons are conducted in Section 3. In Section 4, the analysis of scRNA‐seq data generated from melanoma tumor cells is presented. Section 5 concludes this article with some discussion.

METHOD

The ZENCO model

For modeling dynamic coexpression changes, we use X 1, X 2, and X 3 to denote the count‐based expression levels for three genes. Let represent the gene expression level of the ith gene () in the jth cells (), and represents the gene expression level for the ith gene. In our proposed framework, the marginal distribution of is modeled as a mixture of dropout component and negative binomial component (nondropout events). The distribution of is given by where is the distribution with a point mass at zero; is the dropout rate of ; is the mean of the negative binomial component of ; and is the dispersion parameter of the negative binomial component. The variance of the negative binomial component of is . As goes to 0, . The dropout rate of a given gene, , is modeled as a function of its mean. The dropout rates are study‐specific and can be estimated for a given scRNA‐seq data set. Based on the melanoma data considered in the study, we model the dropout rate using a logistic function: , where μ is the mean of a given gene and b 0, b 1 can be estimated using the expression levels of all available genes in the data (Pierson and Yau, 2015). Furthermore, we use the indicator to describe whether dropout happens or not. If , then the ith gene in the jth cell is successfully amplified (nondropout event). If , then dropout happens. According to the combinations of different values of and , there are four different situations for X 1 and X 2. Their marginal densities can be written as: When , the joint distribution of X 1 and X 2 involves a correlation parameter that depends on the expression level of . In other words, the correlation between and could change according to the level of when all three genes (, , and ) are successfully amplified in the jth cell. If =1 or , and are independent, because at least one measurement of and comes from the dropout component. We model the dependency between X 1 and X 2 and construct our conditional bivariate negative binomial model through a Poisson–Gamma mixture distribution. For and , let A negative binomial distribution of can be generated by integrating over in (3). In this Poisson–Gamma mixture setting, can be considered as the cell‐specific random effect. To introduce the conditional correlation between and given , we utilize a latent variable Z and model the conditional correlation implicitly through the cell‐specific random effect (). Let be a bivariate normal variable that The correlation, , of is specified as is the Fisher's Z‐transformation for the correlation that ensures that the correlation is within (−1, 1). Now, we incorporate this latent variable into the cell‐specific random component () in the Poisson–Gamma mixture in (3) to construct a conditional bivariate negative binomial model of with marginal distribution and and the correlation of depends on . Specifically, for and , let where is the cumulative distribution function of a distribution with and is the cumulative distribution function of a standard normal distribution. maps each point in the interval (0,1) to distribution. Hence, the distribution of is . The distribution of is then a Poisson–Gamma mixture distribution, which follows the negative binomial density . In the model described above, in order to determine the existence of the dynamic coexpression change of X 1, X 2 given X 3, the main parameter of interest is τ1 in (5). If τ1=0, then the correlation between X 1 and X 2 does not depend on X 3 and vice versa. In the ZENCO model, we develop a statistical inference procedure via a Bayesian perspective, because it offers a relatively straightforward way to compute through Markov chain Monte Carlo (MCMC) sampling. In addition, the posterior distributions of the parameters can be obtained with a set of standard conjugate priors. Under the hypotheses: the statistical power of the proposed ZENCO approach can be calculated as follows. First, we obtained the posterior sampling distribution of τ1, and then calculated the 95% equal tail credible interval. Power can be evaluated as the proportion of times when zero is not covered by the 95% credible intervals. We now describe the likelihood function and the MCMC scheme. Let vector be the notation of all parameters (μ1, μ2, μ3, ϕ1, ϕ2, ϕ3, τ0, τ1) in the model. And let be the prior joint distribution of , the likelihood function is given by where and are from observed data and . and are independent given . Hence, the posterior joint distribution of μ1, μ2, μ3, ϕ1, ϕ2, ϕ3, τ0, τ1 given the observations is proportional to where is the distribution of for : The dropout rate is study‐specific and can be determined using all genes measured in the study as a function of described previously. And is the probability density function of a bivariate normal distribution with a covariance matrix structure: For any given , can be derived as described in (4) and (5). Finally, is formulated as in (1). For a given gene triplet, the parameter estimation can be carried out using the MCMC algorithm provided in JAGS (Plummer, 2003). We use the normal distribution with mean 0 and variance 4/N as the priors of τ0 and τ1, where N is the sample size. This is because the approximate variance of Fisher's Z‐transformation is . The priors for μ1, μ2, and μ3 are standard log‐normal distributions. The noninformative priors for the dispersion parameters , , and are the Gamma distribution with mean 100 and relatively large variance 10,000. The sampling scheme during each MCMC iteration is as follows. For , , we sample from ∝ and sample from ∝ , where is the probability density function of Then we sample τ0 from and sample τ1 from where In addition, can be sampled from where .

Search strategies

There are several ways to implement the ZENCO approach in a genomic study. We describe a few here: (i) for a given pair of genes (X 1, X 2), screen the whole genome to identify the coordinator genes (X 3) that regulate the correlation between X 1 and X 2, or (ii) for a given X 3, screen‐related pathways or the whole genome to identify pairs of genes that are modulated by X 3 (m choose 2 gene pairs; m is the total number of genes considered), or (iii) if no prior information about X 3 or (X 1, X 2) is available, screen relevant genetic pathways, or screen the whole genome to identify potential gene triplets that exhibit dynamic correlation changes (m choose three gene triplets). In the experimental data analysis described in Section 4, we demonstrated the second (ii) approach. When the number of relevant genes under consideration is large (for example, ≈ 20,000), a prescreening step is usually beneficial before implementing ZENCO. For example, the algorithm proposed by Gunderson and Ho (2014) or the screening statistic (ζ) introduced in Yu (2018) or filtering out gene with constant expression has been used effectively in the literature.

SIMULATION

To evaluate the performance of our proposed ZENCO model and compare it to existing benchmark approaches, we report results from five simulation scenarios below.

Scenario 1: Simulating data from ZENCO

In this first simulation, we demonstrate generating data from the ZENCO model. The simulated data contain count‐based expression level of three genes: X 1, X 2, and X 3. In our model, the correlations of X 1 and X 2 are modulated by the level of X 3. This simulation was conducted as follows. First, we simulated a set of from a univariate negative binomial distribution with mean μ3 and size ϕ3 and then randomly selected a subset as the dropouts and replaced these with zero. After the simulation of , we calculated correlation coefficient for each . Note that for dropouts in , we used μ3 instead of to calculate , because the values of those dropouts have nothing to do with the regulatory mechanism of X 3. Then, we generated latent variables such that and simulated and using as described in (6). The dependence structure of and is implicitly modeled via . Finally, just like the simulation of , we randomly replaced values of and for dropout events. Using the simulation approach described above, we generated 105 observations from the ZENCO distribution and plotted a panel of conditional distributions of X 1 and X 2 given various levels of X 3 in Figure 1. In these figures, we observed that when X 3 is not zero, ρ increases with X 3. When X 3 is zero, the correlations of X 1 and X 2 are small and show reduced dependency with respect to X 3. This is due to the zero value observation of X 3 being a mixture of true zero and dropout. In other words, some zero values of X 3 come from the negative binomial distribution, others come from dropout events.

FIGURE 1

Profile plots of with varying X 3 (, , , and )

Scenario 2: Comparisons to existing approaches

To evaluate the performance of our proposed ZENCO model, we performed power analysis and compare ZENCO to three other existing approaches. For testing the existence of dynamic coexpression changes, our hypotheses are set up as: First, we compared ZENCO to a bivariate negative binomial regression without considering the zero‐inflated components. Similarly to ZENCO, the statistical power of this method can be calculated as the percentage of times that the posterior 95% credible intervals of τ1 do not cover zero. The ZENCO model and the model without considering the zero‐inflated components were both carried out using the MCMC algorithm with 20,000 iterations, and 10,000 burn‐ins. Second, we compared ZENCO to the existing benchmark approach introduced by Li (2002). This existing approach was later applied to scRNA‐seq data by Yu (2018). This test statistic according to the three‐product‐moment measure is written as: , where , , are the standardized X 1, X 2, X 3 with mean 0, variance 1, and is the three‐product‐moment estimator for the dynamic correlation. , the standard error of , can be estimated via bootstrap. can be used to test whether the correlation of X 1, X 2 depends on X 3, that is, (Li, 2002; Ho et al., 2011). The distribution of under the null hypothesis and associated p‐value can be obtained using a permutation approach. The third comparison is to fit the negative binomial count data with the conditional normal model (CNM‐Full) (Ho et al., 2011). Assuming that data are from the conditional bivariate normal distribution instead of the conditional bivariate negative binomial distribution, the test statistic of this method can be estimated using a generalized estimating equation‐based procedure (Yan and Fine, 2004) and a p‐value associated with the test statistic can be obtained. The powers of these two methods ( and CNM‐Full) can be calculated by counting the percentage of times when p‐values associated with τ1 are less than .05. We simulated 1000 observations from ZENCO model by fixing , , and , and then varied τ1 values and performed power analyses. The simulated values of are based on the estimates obtained from the real data analysis. Figure 2 shows the power curves of the four methods. We observed that our proposed ZENCO method outperforms the other three methods. In addition, fitting the negative binomial count‐based data using Gaussian‐based models reduces statistical power drastically. This is because ZENCO accounts for both zero inflation and overdispersion of the data, and hence achieves better power to detect dynamic dependence structure.

FIGURE 2

Power curves comparing various methods. Both TLA and CNM‐Full approaches are Gaussian‐based models

Scenario 3: Estimation efficiency

In this simulation scenario, we evaluated the estimation efficiency of the ZENCO model and reported mean squared errors (MSE), mean bias errors (MBE), and 95% empirical coverage probabilities under various settings. Three sets of simulation studies were done with sample sizes 200, 500, and 1000. For each simulation study, we generated 1000 data sets. We used the parameter estimated values obtained from the real data analysis in Section 4 and set the true values of the parameters as follows: , , , and . The true values of the parameters associated with dropout rate were similar to the values obtained based on the real data: and (dropout rates for X 1 and X 2 are both 0.44). The empirical 95% coverage probabilities from the posterior distributions and the length of credible intervals are shown in Table 1. In Table 1, we also presented the parameter estimates using a negative binomial model without zero inflation. The empirical 95% coverage probability is calculated as the percentage of times when the 95% credible intervals covering the true parameter value based on 1000 MCMC simulations. The simulation results shown in Table 1 suggest that ZENCO model provides a much better 95% coverage probability than a negative binomial regression method model without zero inflation.

TABLE 1

Coverage probability of 95% credible intervals (CIs) and interval lengths based on 1000 MCMC simulations (, )

		Without zero inflation		With zero inflation
	Parameter	Coverage probability	CI length	Coverage probability	CI length
N=200	τ₀	1.000	0.237	1.000	0.246
	τ₁	0.154	0.041	0.957	0.095
N=500	τ₀	1.000	0.223	1.000	0.244
	τ₁	0.006	0.022	0.961	0.059
N=1000	τ₀	0.957	0.205	1.000	0.242
	τ₁	0.000	0.015	0.954	0.040

Coverage probability of 95% credible intervals (CIs) and interval lengths based on 1000 MCMC simulations (, ) MSEs and MBEs are shown in Table 2. The MBE of a given parameter β is calculated as ; N is the number of simulation iterations (N = 1000). Based on the simulation results in Table 2, ZENCO model has smaller MSEs and MBEs comparing with the nonzero‐inflated negative binomial regression method.

TABLE 2

Mean square errors (MSEs) and mean bias errors (MBEs) based on 1000 MCMC simulations (, )

		Without zero inflation		With zero inflation
	Parameter	MSE	MBE	MSE	MBE
N=200	τ₀	0.001	0.005	0.000	−0.008
	τ₁	0.002	−0.039	0.001	−0.006
N=500	τ₀	0.002	0.024	0.000	−0.009
	τ₁	0.002	−0.040	0.000	−0.001
N=1000	τ₀	0.004	0.048	0.000	−0.009
	τ₁	0.002	−0.041	0.000	0.000

Mean square errors (MSEs) and mean bias errors (MBEs) based on 1000 MCMC simulations (, )

Scenario 4: Robustness

To assess the robustness of the ZENCO method under model misspecification, we conducted three sets of simulations where the data are generated via a negative binomial model without zero inflation. The three sets of simulation studies were performed with sample sizes 200, 500, and 1000, and each with 1000 simulation iterations. The true values of parameters were set as , , , and . We analyzed the simulated data sets using a negative binomial regression method without zero inflation and the ZENCO method. The empirical 95% coverage probabilities from posterior distributions and the length of credible intervals using the above two models are shown in Table S.1; the MSEs and MBEs are shown in Table S.2. The simulation results shown in Table S.1 and Table S.2 suggest that our proposed estimation procedure in ZENCO is fairly robust even when the data are generated from a nonzero‐inflated negative binomial setting.

Scenario 5: A multiple‐gene setting

In this simulation scenario, we turn our attention to a multiple‐gene setting. Our goal here is to demonstrate that our proposed approach could capture dependencies among multiple genes through multiple pairwise searches. We set and , which is similar to the values obtained based on the real data and then simulated five genes (10 gene pair combinations) with , , , , , , , , , . The true values of the 10 range from 0.005 to 0.05, whereas the true value of τ0 was set as 0. The empirical 95% coverage probabilities and MBEs of 10 are shown in Table S.3. The results indicate that our method demonstrated desirable performance under a multiple‐gene setting.

EXPERIMENTAL DATA ANALYSIS

We used the proposed ZENCO model to analyze the melanoma data set described in Rambow et al. (2018). The scRNA‐seq data were obtained from Gene Expression Omnibus (GEO accession number: GSE116237). The data set consists of 57,445 genes and 674 melanoma cells. To study minimal residual disease (MRD) as well as relapse during melanoma treatment, Rambow et al. (2018) performed scRNA‐seq using malignant cells from BRAF‐mutant patient‐derived xenograft melanoma cohorts treated with BRAF/MEK inhibitor (dabrafenib/ trametinib). During the course of continuous treatment with BRAF/MEK inhibitor, the transition of tumor cells can be categorized into three phases: phase 1 is in the early stage when all treated lesions rapidly shrunk upon initial treatment (BRAF‐inhibitor sensitive); phase 2 is the second stage when drug‐tolerant tumor cells remain viable upon continuous treatment (MRD); in phase 3, relapse is observed and tumor cells exhibit adaptive resistance to continuous BRAF inhibition treatment (BRAF‐inhibitor resistance). Among the 674 melanoma cells in the data set, there are 155 phase 1 cells, 199 phase 2 cells, and 148 phase 3 cells. More details can be found in Rambow et al. (2018). To gain insight into transcriptional switches of genetic circuits in tumor cells during the course of BRAF‐inhibitor treatment, we set out to identify gene pairs that interact with BRAF differently between BRAF‐inhibitor sensitive cells (phase 1) and BRAF‐inhibitor resistance cells (phase 3). Hence, in this analysis, we chose BRAF as X 3 and conducted the pairwise analysis for genes in the melanoma pathway described in the KEGG database (Kanehisa and Goto, 2000). According to the melanoma pathway in KEGG database, 72 genes were identified as melanoma‐associated genes. The data were first preprocessed by the procedures described in McCarthy et al. (2017). After removing low expressed genes (maximum count across all cells less than 5) and genes with more than 70% zeros in either phase 1 cells or phase 3 cells, 28 genes were selected for further analysis. The study‐specific parameters, b 0, b 1, associated with dropout rates can be estimated using the logistic function . In the logistic function, we used the sample mean to estimate μ. After calculating the dropout rate as the proportion of cells with zero counts, a nonlinear least‐squares approach was then applied to calculate b 0 and b 1. We implemented ZENCO analyses for 351 gene pair combinations in phase 1 cells and phase 3 cells and obtained the estimates of τ1. To identify the gene pairs that interact with BRAF differently, we chose gene pairs that are in both phase 1 and phase 3 cells and calculated the differences of τ1 estimates between the two phases. The top 30 gene pairs with the largest differences of τ1 between phase 3 and phase 1 are shown in Table 3.

TABLE 3

Top table of dynamic correlations differences. is the difference between τ1 estimates in phase 3 (P3) and phase 1 (P1)

#	Gene1	Gene2	τ1(P1)	τ1(P3)	Δτ1
1	PDGFC	FGFR1	0.045 (0.021, 0.068)	−0.003 (−0.010, 0.005)	−0.047 (−0.072,−0.023)
2	AKT1	BAX	0.040 (0.008, 0.071)	−0.003 (−0.014, 0.008)	−0.043 (−0.075,−0.010)
3	AKT1	PIK3R1	−0.016 (−0.035, 0.004)	0.024 (0.009, 0.038)	0.040 (0.015, 0.062)
4	PDGFC	MAP2K2	0.016 (−0.002, 0.032)	−0.023 (−0.036,−0.006)	−0.039 (−0.059,−0.013)
5	IGF1R	FGFR1	−0.024 (−0.048, 0.000)	0.007 (0.000, 0.014)	0.032 (0.006, 0.056)
6	MDM2	CCND1	0.021 (0.007, 0.031)	−0.011 (−0.018,−0.004)	−0.031 (−0.044,−0.017)
7	AKT1	ARAF	−0.025 (−0.047, 0.002)	0.007 (−0.007, 0.018)	0.031 (0.002, 0.056)
8	AKT1	MAP2K1	0.025 (0.004, 0.057)	−0.006 (−0.017, 0.009)	−0.030 (−0.063,−0.006)
9	AKT1	MAPK1	−0.003 (−0.012, 0.006)	0.026 (0.007, 0.055)	0.029 (0.007, 0.058)
10	KRAS	PDGFC	0.012 (−0.005, 0.024)	−0.017 (−0.042, 0.005)	−0.029 (−0.057,−0.002)
11	IGF1R	MAP2K2	0.025 (0.002, 0.056)	−0.004 (−0.011, 0.006)	−0.028 (−0.060,−0.004)
12	PTEN	PDGFC	−0.022 (−0.036,−0.004)	0.007 (−0.003, 0.014)	0.028 (0.008, 0.044)
13	PTEN	PIK3R1	0.031 (0.007, 0.050)	0.005 (−0.006, 0.014)	−0.027 (−0.048,−0.002)
14	BAX	POLK	0.025 (0.006, 0.048)	0.000 (−0.012, 0.010)	−0.026 (−0.051,−0.003)
15	KRAS	NRAS	0.017 (−0.003, 0.034)	−0.008 (−0.015, 0.002)	−0.024 (−0.043,−0.003)
16	ARAF	RB1	0.020 (0.008, 0.032)	−0.004 (−0.009, 0.002)	−0.024 (−0.037,−0.011)
17	AKT1	RAF1	−0.016 (−0.033,−0.003)	0.007 (−0.004, 0.017)	0.023 (0.006, 0.042)
18	NRAS	MAPK1	0.017 (0.002, 0.029)	−0.005 (−0.013, 0.006)	−0.021 (−0.037,−0.004)
19	PIK3R1	MDM2	0.020 (0.004, 0.035)	−0.001 (−0.010, 0.008)	−0.021 (−0.038,−0.002)
20	IGF1R	TP53	−0.016 (−0.034, 0.002)	0.005 (−0.003, 0.011)	0.020 (0.002, 0.039)
21	BAK1	POLK	−0.018 (−0.030,−0.006)	0.002 (−0.006, 0.010)	0.020 (0.006, 0.034)
22	AKT3	MAP2K2	0.016 (0.005, 0.025)	−0.003 (−0.011, 0.007)	−0.018 (−0.030,−0.006)
23	PTEN	KRAS	−0.005 (−0.016, 0.011)	0.012 (0.003, 0.020)	0.017 (0.000, 0.030)
24	BAD	RAF1	−0.016 (−0.031,−0.006)	0.000 (−0.009, 0.008)	0.016 (0.002, 0.032)
25	IGF1R	CDK6	0.014 (−0.001, 0.026)	−0.002 (−0.008, 0.003)	−0.016 (−0.029,−0.001)
26	RB1	CCND1	0.011 (0.000, 0.020)	−0.004 (−0.010, 0.004)	−0.014 (−0.025,−0.002)
27	AKT2	FGFR1	−0.003 (−0.015, 0.006)	0.011 (0.004, 0.017)	0.014 (0.002, 0.027)
28	BAD	TP53	−0.001 (−0.010, 0.007)	0.013 (0.002, 0.021)	0.014 (0.001, 0.026)
29	NRAS	BAK1	0.001 (−0.008, 0.008)	0.014 (0.006, 0.022)	0.014 (0.002, 0.025)
30	AKT2	BAK1	−0.004 (−0.013, 0.005)	0.010 (0.000, 0.019)	0.014 (0.001, 0.026)

Top table of dynamic correlations differences. is the difference between τ1 estimates in phase 3 (P3) and phase 1 (P1) The first two columns in Table 3 are the names of two genes. is the estimated τ1 in phase 1 cells, and is the estimated τ1 in phase 3 cells. is defined as . It quantifies the change of dynamic coexpression in relation to BRAF between phase 3 and phase 1 cells. From Table 3, we observed that genes PDGFC and FGFR1 have the largest between phase 1 and phase 3 cells. In phase 1 cells, the estimate of τ1 for PDGFC and FGFR1 is 0.045 and the 95% credible interval does not contain 0. In phase 3 cells, the estimate of τ1 is close to 0. This suggests that the regulatory mechanism between BRAF and the gene pair (PDGFC, FGFR1) changes between phase 1 and phase 3 cells. Czyz (2019) pointed out that melanoma cells somehow acquire the ability to grow independent of the two growth factors: FGFR1, PDGFC that helps melanoma cells to gain resistance toward BRAF treatment. Our finding from Table 3 is consistent with this finding. Interestingly, many top gene pairs listed in Table 3 are from the mitogen‐activated protein kinase (MAPK) and phosphoinositide 3‐kinase (PI3K) signaling pathways. Our analysis findings support the hypotheses described in Villanueva et al. (2011). In the above analysis, the convergence of MCMC was assessed using the Gelman–Rubin convergence statistic (Gelman et al., 1992). The convergence statistics were close to 1 for all τ1 estimates in all 351 gene pairs. The trace plots of the top five gene pairs are shown in Figure S.1. In our real data application, it took 67 minutes to implement ZENCO with three chains (100,000 iterations each) for all 351 gene combinations using 13 computing cluster nodes (each with 28 2.4 GHz Intel Xeon E5‐2680 v4 processors).

DISCUSSION

In this paper, we presented a zero‐inflated negative binomial dynamic correlation model for studying covariate‐dependent correlations in zero‐inflated, overdispersed count data, such as scRNA‐seq data. In our model, the correlation of two genes is regulated by the expression level of the third gene; a phenomenon we named dynamic correlation in this paper. This novel dynamic correlation focuses on studying the changes of conditional correlation. It is a different measure from the partial correlation coefficient. The partial correlation quantifies the amount of residual correlation between X 1 and X 2 after regression on X 3 to adjust for the influence of X 3 (Li, 2002). The proposed model in this paper takes both overdispersion and zero inflation of the data into consideration. With the proper choice of the values of parameters τ0 and τ1, the relationship between conditional correlation and the expression level of the third gene can be positive or negative. As demonstrated by our simulation studies, the ZENCO model significantly outperforms other existing approaches. Two other prior distributions for the dispersion parameters , and ϕ3 have been implemented: an informative Gamma distribution on and a half‐t‐distribution on . Our sensitivity analysis suggests that the , and ϕ3 estimates are robust regardless of prior distribution assumptions. The Gamma distribution with mean 100 and relatively large variance 10,000 used in this paper is more general and has slightly better performance in MCMC parameter estimates. Moreover, in our model, ρ is the correlation of the latent variable Z. The Fisher transformation of ρ is assumed to be linear with X 3. In a more general setting, the relationship between log and X 3 does not have to be linear. And our model can be easily adapted to other settings. In the melanoma data analysis, X 3 was used to denote the expression level of BRAF. And ZENCO model was implemented for each pairwise combination of X 1 and X 2 in the KEGG melanoma pathway. Using this search strategy, we found the pairs of genes whose BRAF‐associated dynamic correlations change significantly between different phases during treatment. In Table 3, we reported the top genes with the largest . Several existing type I error control approaches can be used in conjunction with the Bayesian model framework in ZENCO such as Käll et al. (2008) and Dawson and Kendziorski (2012). As described in Section 2, there are several ways to implement ZENCO in a genomic study. If a prefiltering step is used before implementing ZENCO, considerations described in van Iterson et al. (2010); Dawson and Kendziorski (2012) could be helpful to maintain type I error control. Furthermore, in our application, X 3 was used to denote the gene expression level of the BRAF gene because of its pivotal role in melanoma treatment and relapse in the study. In practice, the X 3 can be easily modified to represent the activity level of a biological process or different cell types, or various cellular conditions such as tumor status, survival probability, degree of inflammation, metastasis potential, and so on. Also, X 3 can be easily extended to represent a linear combination of several covariates or biological processes to accommodate the complexity of biological systems in other applications. Because several existing procedures are available for preprocessing scRNA‐seq data to remove low‐magnitude background noise, in the ZENCO model, the dropout component is modeled as a degenerate distribution with a point mass at zero. However, the method can be easily adapted to allow a low‐magnitude Poisson distribution to model the background noise in the dropout component. In this paper, our focus is on the changes in coexpression patterns between a gene pair. It is plausible that there might exist higher order interactions between genes (more than two genes), and a generalization of our approach to higher dimensions is feasible. However, special treatments need to be considered to guarantee the positive definiteness of the variance–covariance matrix in higher dimension. Tables and Figures referenced in Sections 3 and 4 are available with this paper at the Biometrics website on Wiley Online Library. R code and example data are available at the Biometrics website on Wiley Online Library. R code for implementing ZENCO is also available at http://www.github.com/zheny714/ZENCO. Click here for additional data file.

42 in total

1. Modeling dynamic correlation in zero-inflated bivariate count data with applications to single-cell RNA sequencing data.

Authors: Zhen Yang; Yen-Yi Ho
Journal: Biometrics Date: 2021-03-30 Impact factor: 1.701

1 in total

Modeling dynamic correlation in zero-inflated bivariate count data with applications to single-cell RNA sequencing data.

INTRODUCTION

METHOD

The ZENCO model

Search strategies

SIMULATION

Scenario 1: Simulating data from ZENCO

Scenario 2: Comparisons to existing approaches

Scenario 3: Estimation efficiency

Scenario 4: Robustness

Scenario 5: A multiple‐gene setting

EXPERIMENTAL DATA ANALYSIS

DISCUSSION

1. A functional genomic study on NCI's anticancer drug screen.

2. Modeling liquid association.

3. Regression analysis of overdispersed correlated count data with subject specific covariates.

4. Meta-analytic framework for modeling genetic coexpression dynamics.

5. Flexible bivariate correlated count data regression.

6. Filtering, FDR and power.

7. An Arabidopsis gene network based on the graphical Gaussian model.

8. Bayesian approach to single-cell differential expression analysis.

9. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.

Review 10. Single-cell RNA sequencing technologies and bioinformatics pipelines.

1. Modeling dynamic correlation in zero-inflated bivariate count data with applications to single-cell RNA sequencing data.