Qunlun Shen1,2, Shihua Zhang1,2,3,4. 1. NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. 2. School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China. 3. Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China. 4. Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou, China.
Abstract
With the rapid accumulation of biological omics datasets, decoding the underlying relationships of cross-dataset genes becomes an important issue. Previous studies have attempted to identify differentially expressed genes across datasets. However, it is hard for them to detect interrelated ones. Moreover, existing correlation-based algorithms can only measure the relationship between genes within a single dataset or two multi-modal datasets from the same samples. It is still unclear how to quantify the strength of association of the same gene across two biological datasets with different samples. To this end, we propose Approximate Distance Correlation (ADC) to select interrelated genes with statistical significance across two different biological datasets. ADC first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets. ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. We demonstrate the effectiveness of ADC with simulation data and four real applications to select highly interrelated genes across two datasets. These four applications including 21 cancer RNA-seq datasets of different tissues; six single-cell RNA-seq (scRNA-seq) datasets of mouse hematopoietic cells across six different cell types along the hematopoietic cell lineage; five scRNA-seq datasets of pancreatic islet cells across five different technologies; coupled single-cell ATAC-seq (scATAC-seq) and scRNA-seq data of peripheral blood mononuclear cells (PBMC). Extensive results demonstrate that ADC is a powerful tool to uncover interrelated genes with strong biological implications and is scalable to large-scale datasets. Moreover, the number of such genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies.
With the rapid accumulation of biological omics datasets, decoding the underlying relationships of cross-dataset genes becomes an important issue. Previous studies have attempted to identify differentially expressed genes across datasets. However, it is hard for them to detect interrelated ones. Moreover, existing correlation-based algorithms can only measure the relationship between genes within a single dataset or two multi-modal datasets from the same samples. It is still unclear how to quantify the strength of association of the same gene across two biological datasets with different samples. To this end, we propose Approximate Distance Correlation (ADC) to select interrelated genes with statistical significance across two different biological datasets. ADC first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets. ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. We demonstrate the effectiveness of ADC with simulation data and four real applications to select highly interrelated genes across two datasets. These four applications including 21 cancer RNA-seq datasets of different tissues; six single-cell RNA-seq (scRNA-seq) datasets of mouse hematopoietic cells across six different cell types along the hematopoietic cell lineage; five scRNA-seq datasets of pancreatic islet cells across five different technologies; coupled single-cell ATAC-seq (scATAC-seq) and scRNA-seq data of peripheral blood mononuclear cells (PBMC). Extensive results demonstrate that ADC is a powerful tool to uncover interrelated genes with strong biological implications and is scalable to large-scale datasets. Moreover, the number of such genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies.
This is a PLOS Computational Biology Methods paper.
Introduction
High-throughput sequencing technologies (e.g., RNA-seq, scRNA-seq, scATAC-seq) provide an unprecedented opportunity to analyze biological process with large-scale data. For example, The Cancer Genome Atlas (TCGA) profiles numerous cancers with large amounts of omics data [1]; The Human Cell Atlas (HCA) profiles transcriptomics of thousands to millions of cells at single cell level. Recently, scATAC-seq has been greatly improved in terms of cell throughput and sequencing efficiency to view chromatin accessibility [2]. Integrative and comparative studies of such data is becoming a key tool to decipher the underlying relationships among genes [3-6].Differential analysis plays a vital role in comparative studies, and many methods like limma [7] and edgeR [8] have been put forward to identify differentially expressed genes between two different datasets [9]. However, such methods fail to identify genes with similar expression patterns between two datasets with different samples. Meanwhile, the problem of measuring the correlation between two genes in a single dataset or two multi-modal datasets from the same samples has been well studied and can be conducted using Pearson correlation coefficient, Spearman correlation coefficient, Kendall correlation coefficient and so on. However, these methods only capture linear dependence. Complex nonlinear dynamics exist widely in biological systems [10], which cannot be explained by simple linear relationship. More recently, the maximal information coefficient (MIC) [11] has been proposed to discover linear and non-linear dependency among the variable pairs in exploratory data mining, and detected a wide range of interesting associations between pairs of variables. For example, MIC has been applied to yeast gene expression profiles to identify genes whose transcript levels oscillate during the cell cycle [11]. It should be noted that the performance of MIC can be significantly reduced with a limited number of samples in practice [12].However, these methods are incapable of quantifying the associations of genes across two different datasets from different samples, preventing us from finding genes with patially similar expression patterns under different conditions. Specifically, for a target gene, we would like to measure the correlation of this gene across datasets, i.e., the correlation between two gene vectors with different samples (dimensions). This is a tough task since we don’t have matched samples. Thus, the correlation is not supposed to rely on the order of the samples (order-free), which means this correlation is determined by the distribution of genes. The distribution here indicates that the expression of a gene across a given number of cells are considered as the expression observations of this gene in a given dataset. A few advances have been made to measure the correlation of two random vectors with diverse dimensions. For example, distance correlation (DC) [13] was designed based on the principle that the two random vectors are independent if and only if their joint characteristic functions are the same as the product of their marginal characteristic function. An unbiased sample estimation was constructed to calculate the distance correlation coefficient. Under the null hypothesis, there exists a statistic obeying t-distribution based on sample estimation of DC. So we can easily calculate the p value of this hypothesis testing. This method views gene pairs with different dimensions as two gene vectors, and requires multiple observations to measure the associations between them, which makes it possible to compare gene vectors with different dimensions. However, we usually only have one observation or very few replicates for a target gene pair in real biological datasets. It is impossible to calculate DC directly under such circumstances. Therefore, quantifying the strength of association between genes across datasets remains an outstanding challenge.To this end, we propose Approximate Distance Correlation (ADC) to robustly select genes with high interrelation across two datasets (Fig 1). ADC first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets. ADC repeats this process for all genes and then performs the Benjamini-Hochberg (BH) adjustment to control the false discovery rate (FDR). Extensive experiments with simulation data and four biological applications demonstrate that ADC can uncover the most interrelated genes, which illustrates strong biological relevance. Moreover, the number of similar genes selected could serve as an index to measure the degree of similarity across different cell types and technologies. Lastly, ADC can be applied to datasets ranging from thousands to millions of cells.
Fig 1
Schematic diagram of ADC.
Single-cell gene expression data are used as an example to illustrate ADC. X and Y are data matrices with matched genes as the inputs of ADC. For each target gene, ADC selects k genes having the highest Pearson correlation coefficient with the target one to calculate the p value of DC. After that, ADC performs the BH adjustment to control the FDR and outputs the most highly interrelated genes.
Schematic diagram of ADC.
Single-cell gene expression data are used as an example to illustrate ADC. X and Y are data matrices with matched genes as the inputs of ADC. For each target gene, ADC selects k genes having the highest Pearson correlation coefficient with the target one to calculate the p value of DC. After that, ADC performs the BH adjustment to control the FDR and outputs the most highly interrelated genes.
Materials and methods
Data and preprocessing
We applied ADC to four biological scenarios (Tables A-D in S1 Supplementary Materials). (i) We downloaded the gene expression data of 21 different cancers with more than 200 tumor samples for each on 20 March 2020 from http://gdac.broadinstitute.org. For each cancer type, housekeeping genes [14] and genes with no expression in more than half of the tumor samples were removed. After filtering, we log-transformed the expression with a pseudo-count 1 and perform z-score normalization for each gene.(ii) We downloaded the scRNA-seq data of mouse hematopoietic cells from NCBI Gene Expression Omnibus with accession code GSE81682. Cells with less than 200 expressed genes and genes expressed in less than 3 cells were filtered. Also, cells with more than 25% mitochondrial genes present were filtered. We normalized the cells by size factor using the computeSumFactors function in the scran package [15] and log-transformed the gene expression data with a pseudo-count 1. For each cell type, the top 1000 highly variable genes were selected with scanpy package [16] as the input to ADC. There are six cell types remained including hematopoietic stem cells (HSC), common myeloid progenitors (CMP), granulocyte-onocyte progenitors (GMP), megakaryocyte-erythroid progenitors (MEP), common lymphoid progenitor (MPP), lymphoid-primed multipotent progenitor (LMPP) with 323, 328, 123, 362, 368, 280 cells, respectively.(iii) We obtained five scRNA-seq datasets of pancreatic islet cells profiled by five different technologies including indrop, CEL-Seq2, 10X, CEL-Seq, Smart-seq2 respectively [17-21]. Housekeeping genes were removed [14]. We normalized the gene counts per cell with the scanpy package and log-transformed the expression with a pseudo-count 1 and the z-score normalization were also performed for each gene. Finally, these data consists of 8569, 4776, 2477, 1538, 3354 cells respectively.(iv) We obtained the scRNA-seq data of human PBMC [22] and the scATAC-seq data [23]. The latter was transformed to gene activity matrix by the CreateGeneActivityMatrix function in the R package Seurat and the annotation file Homo_sapiens.GRCh38.91. The preprocessing procedure was the same as (iii). Finally, these data consists of 8728, 2638 cells respectively.
The ADC algorithm
Here, we have two biological data matrices X ∈ R and Y ∈ R with p matched genes, and each row represents a gene while each column represents a sample (cell). ADC is designed to measure the interrelation for each gene between two datasets based on distance correlation (DC) and selects the most similarly interrelated ones (Fig 1 and Algorithms in S1 Supplementary Materials).
DC
DC is calculated based on distance covariance. Distance covariance is a method to measure the distance between the product of marginal characteristic functions of two random vectors X ∈ R and Y ∈ R and their joint characteristic function. It is defined as:
where
here Γ(⋅) is the gamma function and the weight function w(t, s) ensures distance covariance is less than infinity. This definition is similar to that of the classical covariance and has a significant property, i.e., X and Y are independent if and only if . Just as the standard definition of correlation coefficient, the DC is defined as:However, we don’t know the exact distributions of X and Y. If we have k observations from them, we could use sample estimation to approximate DC. Let and denote the empirical characteristic functions of X, Y and (X, Y) (S1 Supplementary Materials). The sample estimation of distance covariance for random vectors X, Y is defined as:
and the sample estimation of DC is
It can be proved that under the independence hypothesis, as m, n tend to infinity,
converges to t-distribution with degrees of freedom. Thus, we can calculate the p-value easily for each hypothesis testing. We can see that the DC doesn’t depend on the order of samples of both datasets from the derivation process (S1 Supplementary Materials).
Select approximate observations for each gene
Naturally, multiple observations for each gene across both datasets are needed to estimate the empirical characteristic functions (S1 Supplementary Materials). However, we usually only have one observation for each gene across pairwise biological omics data. Profiling the given cells for multiple times to achieve this is unrealistic, since the cells are destroyed during the measurement process. Thus, it is impossible to perform the estimation directly. As we know, a set of genes usually tend to be highly correlated under the same condition due to the the modular organization of biological systems. In view of this, we introduce an approximate strategy to select k highly correlated ones with it as alternative observations in a single dataset to overcome this issue. We applied this procedure to all the genes individually. The value of k is set to 30 by default.
The BH adjustment for controlling FDR
Since we repeat to calculate ADC for each gene, we expect to control the FDR, which is defined as the expectation of false discovery proportion (FDP):
where X ∈ R and Y ∈ R, i ∈ {1, …, p}, is a subset of {1, …, p} denoting highly interrelatedd genes selected by ADC, a ∨ b = max{a, b} and # denotes the size of a set. In the language of hypothesis testing, we are interested in the m hypotheses {H: X and Y are independent} and want to control FDR. For instance, if we control the FDR at 10% and select 100 pairs of genes, then at most ten of them are false.The BH adjustment strategy is a classical method to control FDR for multiple hypothesis testing [24]. Consider multiple testing H1, H2, …, H with corresponding p-value P1, P2, …, P, we order the p-value so that P(1) ≤ P(2) … ≤P(. The adjusted p-value Q( for P( is calculated as follows:
For a given FDR level α, we call the ith testing is significant and only if Q( < α.
Computational complexity of ADC
To rapidly select the k approximate observations, ADC first constructs a Pearson correlation matrix for each dataset, and its computational complexity is O(p2 max{n, m}]. After that, ADC searches approximate observations for each gene based on the quicksort algorithm, and its total complexity for all genes is O(p2 logp). Then, ADC calculates the DC for each gene pair across the two datasets, and its complexity is O(k3(k + max{n, m}]) (S1 Supplementary Materials). Since k is a constant in ADC, the complexity of calculating the p-value of DC for all genes here is O(pmax{n, m}]. The BH adjustment is based on a sorting method, so its complexity is O(plogp). Taken together, the computational complexity of ADC is O(p2 max{n, m}] + O(p2 logp) + O(pmax{n, m}] + O(plogp) = O(p2(max{n, m} + logp)). Generally, the logp is far more less than max{n, m} for biological datasets, so the complexity can be reduced to O(p2(max{n, m}]. It is worth noting that in gene expression datasets, p, denoting the number of genes, is usually less than twenty thousand which will not increase so much, so the computational complexity of ADC is a linear complexity relating to the number of cells. Simulation experiments further confirmed this. Although the complexity is quadratic with p, the growth rate of quadratic functions is very slow in the simulation study (Fig B(a) in S1 Supplementary Materials). All these results suggest that ADC is very efficient in handling large-scale datasets. ADC was implemented in Python and the package can be downloaded from https://github.com/zhanglabtools/ADC.
Quantify the degree of similarity between two biological datasets
How to quantify the underlying similarity of datasets across different conditions, cell types, technologies and modalities is an important issue. For two single-cell omics data, we think that the expression patterns of genes in the same types of cells tends to be more similar. It is expected that ADC can detect more interrelated genes in the datasets with more common cell types. Thus, given a series of datasets, the number of genes selected by ADC could serve as a metric to measure the similarity between any two different datasets. The reciprocal of this number can be used to construct a distance matrix, and hierarchical clustering methods can be further applied it to build the hierarchical relation among these datasets. This could play key roles in many situations such as clustering cancer types with strong molecular similarity, measuring the consistency across technologies, exploring the degree of differentiation of progenitor cells, and mining the biological signals preserved across modalities.
Results
Simulation studies
In the simulation studies, we control the FDR with the BH adjustment method and verify the power of DC for selecting interrelated vectors. Here power means the probability of selecting genes exhibiting similar expression patterns. We considered four different simulation scenarios with 1000 vector pairs for each, half of them share similar structures. The dimensions of each pair of X and Y are m = 3000 and n = 10000, respectively. They are relatively large to the observation size 30, and the association of X and Y is built through the l shared dimensions. The nominal FDR level in all the experiments are 20%.Scenario 1. Draw X = (x1, …, x) independently from a standard normal distribution. Let y = sx (i = 1, …, l) and draw y (i = l + 1, …, n) independently from a standard normal distribution, where s ∼ U(1, 5).Scenario 2. Draw X = (x1, …, x) independently from a standard normal distribution, after that 90% of the entries are replaced with zeros randomly. Let y = sx (i = 1, …, l) and draw y (i = l + 1, …, n) independently from a standard normal distribution, where s ∼ U(1, 5), after that 90% of the entries in y (i = l + 1, …, n) are replaced with zeros randomly.Scenario 3. Draw X = (x1, …, x) independently from U(0, 1). Let y = log(x) (i = 1, …, l) and draw y (i = l + 1, …, n) independently from a standard normal distribution.Scenario 4. Draw X = (x1, …, x) independently from U(0, 1), after that 90% of the entries are replaced with zeros randomly. Let y = log(x) (i = 1, …, l) and draw y (i = l + 1, …, n) independently from a standard normal distribution, after that 90% of the entries in y (i = l + 1, …, n) are replaced with zeros randomly.Numerical results demonstrated that for all the four scenarios, FDR could be controlled well as expected with l > 25, and the power increased quickly with the l growing (Fig 2). Further, the FDR is around 10%, which is only a half of the nominal level. Thus, this method is very conservative and we could expect much lower FDR in real applications. Moreover, even the two vectors only have l = 100 related dimensions, which is very small compared with the lengths of the two vectors, the power will exceed 90% or 80% for the vectors with linear or nonlinear dependence structures, respectively (Fig 2A and 2C), indicating DC is very sensitive to capture tiny similarities. More importantly, sparseness does not hinder the power of DC even if zero-entries of the data is up to 90% (Fig 2C and 2D). These results hold true even the random vectors were generated from a much more complicated distribution (Fig A in S1 Supplementary Materials). Therefore, DC combined with the BH adjustment method is a powerful method to capture the dependent structure in data with high power and stable FDR control. When applying to biological data, we expect to select genes with similar expression (or regulatory) patterns. The length of the gene vector is the number of cells, we believe that even if genes are interrelated in very few cells across two datasets, DC can still accurately identify them.
Fig 2
Simulation experiments on DC combined with the BH method in terms of Power and FDR (the target level is 20%).
(A) Each pair of vectors are dense, and k dimensions are shared with a linear transform. (B) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a linear transform. (C) Each pair of vectors are dense and k dimensions are shared with a non-linear transform. (D) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a non-linear transform.
Simulation experiments on DC combined with the BH method in terms of Power and FDR (the target level is 20%).
(A) Each pair of vectors are dense, and k dimensions are shared with a linear transform. (B) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a linear transform. (C) Each pair of vectors are dense and k dimensions are shared with a non-linear transform. (D) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a non-linear transform.Next, we used the splatter R package [25] to simulate three scRNA-seq datasets with four cell types (Fig 3A). As expected, the results of ADC is consistent with the data structure (Fig 3B), and it selected most correlated genes between Data 1 and Data 3. Although there are only 40% cells overlapped between Data 1 and Data 2, ADC accurately captured the weak associations and revealed fewer correlated genes than those selected between Data 2 and Data 3 as expected. We also test the influence of hyper-parameter k by applying ADC to Data 1 and Data 2. Here, we selected the top 100 highly interrelated gene under each k varying from 20 to 40. For each pair of k, the number of overlapped genes was calculated. In average, we got almost 93 genes, that suggests ADC is a robust algorithm considering the selection of k (Fig B(c) in S1 Supplementary Materials).
Fig 3
Performance of ADC on three simulated datasets.
(A) The t-SNE plot of these datasets with three cell types in total. 80% of cells in Data 1 and Data 3 are of the same type, 60% of cells in Data 2 and Data 3 are of the same type, and 40% of cells in Data 1 and Data 2 are of the same type. Each dataset contains 1000 cells and 5000 genes. (B) Heat map of the highly interrelated genes selected by ADC across Data 1, Data 2 and Data 3 (FDR = 0.05). (C and D) Running time and peak cost of ADC with two datasets with 1 million cells and 10 thousand genes each in no more than 135 mins and under 225 GB of RAM. Each entry of the datasets was generated with a random variable which obeys uniform distribution. GB indicates the GigaByte.
Performance of ADC on three simulated datasets.
(A) The t-SNE plot of these datasets with three cell types in total. 80% of cells in Data 1 and Data 3 are of the same type, 60% of cells in Data 2 and Data 3 are of the same type, and 40% of cells in Data 1 and Data 2 are of the same type. Each dataset contains 1000 cells and 5000 genes. (B) Heat map of the highly interrelated genes selected by ADC across Data 1, Data 2 and Data 3 (FDR = 0.05). (C and D) Running time and peak cost of ADC with two datasets with 1 million cells and 10 thousand genes each in no more than 135 mins and under 225 GB of RAM. Each entry of the datasets was generated with a random variable which obeys uniform distribution. GB indicates the GigaByte.Moreover, ADC could handle two datasets with ten thousand cells each in about 15 minutes. The running time increased linearly in term of the number of cells (Fig 3C), which confirms our derivation on the computational complexity. The peak memory cost of ADC is less than twice of generating the data, indicating ADC consumes less memory than copying the data (Fig 3D). All results suggest that ADC is a powerful method and applicable to large-scale datasets.
Highly interrelated genes across-cell types reveal cancer similarities
We applied ADC to the gene expression data of 21 different cancers and selected highly interrelated genes across each pair of cancers (Fig 4A). The number of highly interrelated genes between each pair of cancers clearly reflects the degree of similarities between different cancers (Fig 4A). Hierarchical clustering based on it demonstrates that cancers generated in similar tissues are clustered together, such as GBMLGG and LGG, KIPC and KIRP, COAD and COADREAD, SRAD and STES. Further, the two cancer types BRCA and OV occurring most frequently in women demonstrate great similarity. Besides, we also found a cluster of two squamous carcinoma, HNSC and CESC, which have been shown to have strong molecular similarity [26]. All these observations indicate that ADC could gain deep insights into the common characteristic of cancers.
Fig 4
Highly interrelated genes and their biological functions among 21 cancers.
(A) Heatmap of the number of highly interrelated genes selected by ADC between each pair of cancers (FDR = 0.05). Hierarchical clustering was performed with the reciprocal value of the number. (B and D) The top ten enriched functional terms of these selected genes between HNSC and CESC (B), BRCA and OV (D) respectively. (C and E) The gene network constructed with GeneMANIA using the selected genes between HNSC and CESC (C), BRCA and OV (E) respectively.
Highly interrelated genes and their biological functions among 21 cancers.
(A) Heatmap of the number of highly interrelated genes selected by ADC between each pair of cancers (FDR = 0.05). Hierarchical clustering was performed with the reciprocal value of the number. (B and D) The top ten enriched functional terms of these selected genes between HNSC and CESC (B), BRCA and OV (D) respectively. (C and E) The gene network constructed with GeneMANIA using the selected genes between HNSC and CESC (C), BRCA and OV (E) respectively.Clearly, the enriched biological functions of the genes selected from two pairs of cancers implicate to key cancer hallmarks [27] such as cell-cell adhesion, T cell proliferation, immune system process and response to interferon-γ, suggesting that these selected genes are indeed biologically relevant (Fig 4B and 4D). We also constructed gene functional networks by GeneMANIA [28] with the physical and genetic interactions of these genes (Fig 4C and 4E). The hub gene APPL1 of the first network is related with cell cycle [29]. Moreover, it is an immune cell enhanced gene and the protein encoded by this gene has been shown to be involved in the regulation of cell proliferation [30]. In the second functional network, the hub gene EMSY is an oncogene linking the BRCA2 pathway to sporadic breast and ovarian cancer [31] and is involved in several cancer related biological processes like DNA damage, DNA repair, transcription and transcription regulation. The enriched functional terms of other cancer pairs also reflects cancer hallmarks (Fig C in S1 Supplementary Materials). These results shows that ADC indeed could find many strongly associated genes of two cancers.
Highly interrelated genes across cell types reveal their hierarchical lineage
Here we applied ADC to six distinct hematopoietic stem and progenitors from mouse bone marrow at single cell level (Fig 5A). The numbers of highly interrelated genes of any two cell types clearly reveal their lineage structure (Fig 5B). As expected, HSC giving rise to all hematopoietic lineages, shows distinct differences with other cell types at the end of differentiation according to the selected genes based on ADC (Fig 5B). For HSC, ADC selected the most correlated genes with LMPP than any other cell types, we speculated that LMPP could be directly derived from HSC, which was confirmed by a recent study [32]. CMP is an early myeloid progenitor cell which differentiates to GMP and MEP, but it shows much more similarities with its progenitor MPP. Moreover, we found that CMP have more similarities with GMP compared with MEP based on the number of selected genes (182 vs 173). This was first confirmed by visualization of these cells (Fig D(a) in S1 Supplementary Materials), where CMP clearly has more cells overlapped with GMP. We further applied an unsupervised clustering method leiden [33] to identify 13 clusters (Fig D(b) in S1 Supplementary Materials), and CMP shows similar cell distribution with GMP compared to MEP (Fig 5C), and the cells overlapped between these two cell types are far more than that between CMP and MEP (Fig 5D). Further, the CMP and GMP belong to myeloid cells while MEP belong to the erythroid cells, and a study also shows that CMP and GMP preferentially differentiate into proangiogenic cells compared with MEP [34]. Thus, CMP and GMP should be more similar at the gene expression level, and the same result has also been observed from the human hematopoietic cell data (Fig 5B). All these results indicate that ADC can capture the true genetic similarities and infer cell lineage structure using the sparse scRNA-seq data.
Fig 5
Highly interrelated genes selected by ADC between different cell types along the hematopoietic cell lineage.
(A) A schematic of mouse hematopoietic differentiation. The cells in gray color cell type were not present in our datasets. (B) Heat map shows the number of highly interrelated genes selected by ADC cross six cell types (FDR = 0.10). The upper triangular is the result of mouse hematopoietic cells while the lower triangular is the result of human hematopoietic cell data downloaded from GEO with accession code GSE117498. Unsupervised hierarchical clustering analysis was performed with the reciprocal value of the number. (C) Confusion matrix of data-driven clusters representing the percent frequency distribution of immunophenotypically defined cell types. (D) Heat map shows the dissimilarity of the distributions of the three cell types using the Jensen-Shannon (JS) divergence.
Highly interrelated genes selected by ADC between different cell types along the hematopoietic cell lineage.
(A) A schematic of mouse hematopoietic differentiation. The cells in gray color cell type were not present in our datasets. (B) Heat map shows the number of highly interrelated genes selected by ADC cross six cell types (FDR = 0.10). The upper triangular is the result of mouse hematopoietic cells while the lower triangular is the result of human hematopoietic cell data downloaded from GEO with accession code GSE117498. Unsupervised hierarchical clustering analysis was performed with the reciprocal value of the number. (C) Confusion matrix of data-driven clusters representing the percent frequency distribution of immunophenotypically defined cell types. (D) Heat map shows the dissimilarity of the distributions of the three cell types using the Jensen-Shannon (JS) divergence.
Highly interrelated genes across technologies illustrate their inherent difference
Here we applied ADC to five scRNA-seq datasets of pancreatic islet cells profiled by indrop, CEL-Seq2, 10X, CEL-Seq, Smart-seq2, respectively. Obviously, common cell types like alpha, beta, ductal, acinar are shared by these five datasets. Conceptually, highly variables genes (HVGs) among data should strongly contribute to cell-to-cell variation within relatively homogeneous cell populations [35]. That is to say, HVGs could depict the cell characteristics well. However, we find that the top HVGs overlapped among these datasets cannot reflect the similarities between these technologies (Fig 6A). The two linear amplification methods CEL-seq and CEL-seq2 are separated in two different clusters irrationally [36, 37]. While the two droplet-based methods, 10X and indrop, which can profile a mass of cells and have a relatively low resolutio [38], also shows less similarity than that between 10X and CEL-seq2. Even if we increased the number of HVGs (Fig E(a-d) in S1 Supplementary Materials), the results still failed to reveal the relative differences of these technologies. Strikingly, ADC could identify a number of correlated genes across-technology data, and reveal the technology similarity across them (Fig 6B). Specifically, CEL-Seq showed the most similarity with its modified version CEL-Seq2, 10X and indrop were clustered together while other three methods which profiled genes with a higher resolution showed distinct similarities.
Fig 6
(A) Number of overlapping genes among the top 1000 HVGs of the data from five different technologies, and (B) Highly interrelated gene selected by ADC between each pair of these datasets (FDR = 0.01).
Unsupervised hierarchical clustering was performed with the reciprocal value of the number in both situations.
(A) Number of overlapping genes among the top 1000 HVGs of the data from five different technologies, and (B) Highly interrelated gene selected by ADC between each pair of these datasets (FDR = 0.01).
Unsupervised hierarchical clustering was performed with the reciprocal value of the number in both situations.
Similarly correlated genes across modalities reveal biological relevant genes
Next we applied ADC to scATAC-seq (using gene activity matrix [39]) and scRNA-seq datasets of human peripheral blood mononuclear cells (PBMC) with 8728 cells and 2638 cells respectively to select correlated genes across them (Fig 7A). ADC revealed 195 highly related genes, which are involved in PBMC related biological process like interferon-gamma production, leukocyte differentiation, antigen receptor-mediated signaling pathway (Fig 7B) [40, 41]. The hub gene of the network, GNAI2, which regulates the entrance of B Lymphocytes into lymph nodes and B cell motility within lymph node follicles [42]. Besides, another hub genes FMNL2, is also enriched in monocytes [30]. All these observations demonstrate that although scATAC-seq and scRNA-seq data are different types, we can still explore molecular similarities across these different modalities.
Fig 7
Biological functions and network analysis of highly interrelated genes selected by ADC across modalities.
(A) Schematic diagram shows the work flow of performing ADC on the scRNA-seq data and scATAC-seq data (FDR = 0.20). We first converted the scATAC-seq data to a predicted gene expression matrix. Specifically, we constructed a “gene activity matrix” from scATAC-seq dataset by utilizing the reads at gene body and 2kb upstream, then ADC was applied to a pseudo gene expression data and a real one. (B) The top enriched functional terms of the genes selected by ADC. (C) The gene network constructed with GeneMANIA using the top genes selected by ADC.
Biological functions and network analysis of highly interrelated genes selected by ADC across modalities.
(A) Schematic diagram shows the work flow of performing ADC on the scRNA-seq data and scATAC-seq data (FDR = 0.20). We first converted the scATAC-seq data to a predicted gene expression matrix. Specifically, we constructed a “gene activity matrix” from scATAC-seq dataset by utilizing the reads at gene body and 2kb upstream, then ADC was applied to a pseudo gene expression data and a real one. (B) The top enriched functional terms of the genes selected by ADC. (C) The gene network constructed with GeneMANIA using the top genes selected by ADC.
Discussion
With the rapid development of high-throughput sequencing technologies (e.g. RNA-seq, scRNA-seq, scATAC-seq), a huge number of biological datasets under different conditions have been profiled. Here we proposed ADC to analyze the pairwise data generated from different tissues, conditions or technologies with high efficiency. To our knowledge, this is the first method to identify highly interrelated genes across different biological datasets.ADC derived from DC measures the correlation of random variables in arbitrary dimensions. As we shown in the simulation studies, combined with the BH adjustment method, DC could accurately detect associated variables with high power and low FDR, even there are only few related dimensions. Moreover, the number of selected genes can reflect the degree of similarity between datasets. ADC is applicable to massive datasets with millions of samples in several hours. Extensive tests on four real applications demonstrate its effectiveness. Also, ADC can find tissue-specific genes which are interrelated across human and mouse, where these genes capture conserved functions across them (Fig F in S1 Supplementary Materials). Thus, ADC is expected to be applied to diverse biological datasets under different conditions.The number k of selected highly correlated genes in each dataset is an important parameter which could affect the gene sets for DC calculation. We have empirically demonstrated that the default one work well in the simulation studies and biological applications. Alternatively, we could provide a more robust version by leveraging the interrelated genes selected by ADC with different ks. Specifically, we detected the interrelated gene under each k varying from 20 to 40, and we kept genes which were considered as interrelated ones under 21 different ks. The results are very consistent with the original ones (Fig 6B and Fig E(e) in S1 Supplementary Materials).Recently, with the improvement in high-throughput single cell technologies, different experiments such as chromatin accessibility [43], DNA methylation [44], RNA modification [45] and gene expression datasets at single cell level are increasing rapidly (Fig G in S1 Supplementary Materials). Many methods have been put forward to process these data [46-48]. Most importantly, integration analysis to transfer knowledge from one dataset to another has became increasingly important [49]. For example, Haghverdi et al. [50] used mutual nearest neighborhoods (MNN) on the orginal data space to find the same cell types. Stuart et al. [23] improved this methods by performing MNN on the reduced dimension learned by canonical correlation analysis. However, these methods only focus on the integration problem itself but ignore whether these datasets can be integrated or not, which may lead to “overcorrection” [51]. Moreover, they may fail to consider the order of the integration. While ADC can provide the order of integration based on the number of selected genes, which can help to improve the integration result. Also, from the perspective of knowledge transfer, given a newly generated data with limited prior knowledge, we would like to select a most correlated dataset which has been fully explored as a reference. ADC is a suitable tool under this circumstance since it can quantitatively measure the degree of similarity between two datasets. Furthermore, ADC can also uncover interrelated variables at molecular level, which may increase our insights into a new dataset. Because ADC has no assumption, it can be directly applied to other biological datasets without any modification. ADC is a valid data-driven method which doesn’t employ any external information now. It will be an interresting direction to consider external knowledge (e.g., gene ontology) to improve it in future.
Supplementary figures, tables and analysis details.
Further detailed descriptions of the principle of ADC, the computational complexity of ADC, and more information about the methods and process of data analysis for single-cell RNA-seq data. Fig A. Simulation experiments on DC combined with the BH method in terms of Power and FDR (target is 20%). We generated each pair of variables with 3000 and 10,000 dimensions, respectively. Every non-zero entry of the variables was sampled from a beta(2,4) distribution. (a) Each pair of vectors are dense and k dimensions are shared with a linear transform. (b) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a linear transform. (c) Each pair of vectors are dense and k dimensions are shared with a log transform. (d) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a log transform. Fig B. Performance of ADC on simulated datasets. (a and b) Running time and peak menmory cost of ADC with two datasets with 10 thounsand cells and different number of genes. GB indicates the GigaByte. (c) We applied ADC to data1 and data2 generated by splatter (Fig 3A) under each k from 20 to 40. For each k, we selected top 100 interrelated genes and calculated the number of overlapped top genes for each pair of k. The result is showed in the boxplot. Fig C. Functional enrichments of selected genes between five pairs of cancers. (a) BLCA and LUSC, (b) GBMLGG and LGG, (c) KIRC and KIRP, (d) STAD and STES, and (e) COAD and COADRED. Fig D. PCA plots of hematopoietic stem cells CMP, GMP and MEP. (a) The cells are colored by the cell types annotated by the combination of molecular surface markers. (b) The cells are colored by the cell types annotated by an unsupervised clustering method Leiden. Fig E. Numbers of selected genes across five technologies for the data with different number highly variable genes. (a) top 2000 genes, (b) top 3000 genes, (c) top 4000 genes, and (d) top 5000 genes. Unsupervised hierarchical clustering analysis is performed with the reciprocal value of the numbers. Fig F. Enriched GO terms of highly interrelated genes between human and mouse done by Metascape. Fig G. The number of scRNA-seq datasets per month from January 2013 to July 2020. The statistics was obtained from NCBI Gene Expression Omnibus with key words: “single cell RNA seq” or “single cell transcriptome” or “single cell gene expression”. Table A. Number of samples of 21 types of cancer. 17 out of 38 types of cancer in TCGA were excluded in our study due to limited numbers of samples. Table B. Number of cells of six hematopoietic cell types. Table C. Number of pancreatic islet cells from five different technologies. Table D. Numbers of PBMC cells from two different sequencing methods.(PDF)Click here for additional data file.9 Jun 2021Dear Dr. Zhang,Thank you very much for submitting your manuscript "Approximate distance correlation for selecting similarly distributed genes across datasets" for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that thoroughly addresses the reviewers' comments. In particular, as pointed out by multiple reviewers, you should significantly improve the rigor of the method evaluation and data interpretation and further demonstrate the advantages over existing methods (e.g., for data integration application). In addition, please make the source code associated with the proposed method publicly available.We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Mingyao LiAssociate EditorPLOS Computational BiologyJian MaDeputy EditorPLOS Computational Biology***********************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: In this study, the authors developed a method to study the gene expression conservation of each gene between different datasets based on distance correlation using co-expressed genes. They showed that their method can identify conserved genes from bulk RNA-seq, scRNA-seq, and scATAC-seq datasets. Although the results are compelling, the paper can be further improved. See detailed comments below.1. The authors select k=30 highly correlated genes in each dataset to calculate the distance correlation. How the selection of k affects the result is not clear. Why selecting k=30 as the default value has to be justified.2. How batch effects affect the result is not examined. The authors showed in Figure 6 that their method can capture more similar genes between sequencing platforms that are similar. It shows batch effects (here platform effects) are clearly affecting the results.3. The proposed method aims to reveal genes that have similar expression patterns between two datasets. Can it be generalized to more than two datasets? For example, can it be applied to reveal common genes across all cancer samples showed in Figure 4?4. Although the method is designed to reveal genes that are similar between datasets, can it reveal genes that are most diverse (not similar) between datasets?5. For the hematopoietic single-cell RNA-seq analysis in Figure 5B, the author only showed results for CMP, GMP, and MEP for human bone marrow. How about the results from other cell types? Is it also consistent with the mouse data?6. Another interesting question the authors haven’t explore is cross-species conservation. Can the proposed method reveal genes that have similar expression patterns between human and mouse?7. For Figure 7, the authors studied the similar genes between PBMC scRNA-seq and scATAC-seq datasets. Recently, multi-modal scRNA-seq+scATAC-seq datasets are also available for PBMC which can be obtained from the 10X genomic website. It will be interesting to see if the genes identified by the proposed method between scRNA-seq and scATAC-seq are similar to those identified using simple correlation metrics based on the multi-modal datasets.8. The application of the proposed method to data integration is not convincing. The state-of-the-art methods for single-cell data integration such as Harmony are not affected by the order of sample in integration. Also, the example is too simple. If the authors want to show the usage of their method in integration, they may want to provide a more comprehensive analysis.Reviewer #2: In this paper, the authors propose to use the Approximate Distance Correlation (ADC) to identify similarly distributed genes between two biological datasets, and subsequently to use the identified genes to measure the similarity between these two datasets. For each gene, they first identify its k most correlated genes within each dataset, and then they use the measurements of these two sets of k genes in their corresponding datasets to calculate the distance correlation (DC), which they define as the ADC.Although the paper is overall clearly written, we have several major concerns about the validity of the methodology. In our opinion, these concerns, if not addressed, will preclude ADC to be a useful data analysis tool.1. Our first concern is the motivation of ADC. If the goal is to identify genes with similar distributions in two datasets, then what is the definition of “distributions.” To us, this is the key and should be clarified in the beginning.2. We have a major concern about applying the distance correlation to the manually selected k genes in each dataset. In Section “Materials and methods”---"Select approximate observation for each gene," distance correlation requires (X_i, Y_i) to be a true pair, and {(X_i, Y_i)} to be iid. Here, however, (X_i, Y_i) is not from the same gene and thus not naturally paired, and {(X_i, Y_i)} are neither independent (they are manually selected, with high correlation) nor identically distributed (they are from different genes). Hence, the basic assumptions of the distance correlation are violated.3. Due to the fact that there exists only one observation for each gene, we suggest that the authors view the biological problem on a metagene level as a remedy. To be more specific, authors can first cluster genes into metagenes in the original datasets, and then they can calculate the distance correlation for each metagene. In this way, replicates (X_i, Y_i) are naturally paired for the i-th gene in metagene.4. If the authors do want to achieve the single-gene resolution, to remedy the violation of the assumptions required by the distance correlation, the authors may consider taking the advantage of gene orthology information instead of using the same data twice to select the top k correlated genes for every gene. At least the authors must make sure that (X_i, Y_i) is paired by external information, not the data to be used to calculate the distance correlation.5. There's little discussion on the choice of parameter k (number of selected highly correlated genes in each dataset). It affects the sets of genes for distance correlation calculation, as well as the asymptotic distribution. Instead of simply proposing a default value of k, the authors should discuss the effect of k and provide guidance on choosing k.6. In Fig 6A, we see that results are far from satisfactory in the overlap of top HVGs. We wonder whether this is a fair comparison. Are the selected top HVGs restricted to the common genes shared between datasets?7. The authors should apply more direct validation other than comparing the number of similarly expressed genes and the data structure. For example, to highlight that the algorithm can detect similarly distributed genes across datasets, the authors can directly compare the overall expression pattern of the same genes with high ADC scores in both datasets.Reviewer #3: In this manuscript, the authors develop a statistical method, namely Approximate Distance Correlation (ADC) for detecting genes with similar expression patterns across multiple samples or datasets. ADC is an alternative to DC, where multiple samples are needed for calculation. Instead of multiple samples, ADC utilizes the genes that are highly correlated with a gene, as alternative observations. The authors formulate their solutions and show its effectiveness with multiple case studies.Although the idea seems interesting and the case study results are impressive, there are some problems with the manuscript, as pointed out below:Major Comments:1. The most important issue about this method is using the correlated genes as alternative observations for a gene. Authors tell that this is necessary because often the experiment cannot be repeated for the same sample, noting that “Profiling the given cells for multiple times to achieve this is unrealistic, since cells are destroyed during the measurement process.” However, in the same manner, using correlated genes as alternative observations is similarly unrealistic, since almost every gene has its own distinct expression pattern. In this sense, I think it is highly questionable to use correlated genes for this purpose, instead of relying on multiple cells, which are usually in the same biological state. I think the approach followed here, using ADC instead of DC, is only diverting the problem of reproducibility on a different aspect, instead of solving it.2. In this context, the authors should compare their method to using Distance Correlation (DC) for each benchmarking and case study. Does their method (ADC) outperform DC, by using the same set of samples? Since ADC depends on correlation, which requires multiple samples or cells, DC can be computed and used for every test.3. Page 5, Line 117: “Profiling the given cells for multiple times to achieve this is unrealistic. Since cells aredestroyed during the measurement process” should be single sentence “Profiling the given cells for multiple times to achieve this is unrealistic, since since the cells are destroyed during the measurement process.”4. Page 7, Line 176: What do m and n denote? Genes and cells?5. Page 8, Line 216: “Although there are only 40% cells overlapped between Data 1 and Data 3” contradicts with the Fig. 3 legend: “40% of cells in Data 1 and Data 2 are of the same type.” Is it Data 2 or Data 3?6. Page 9, Figure 3 legend: FDR is chosen 0.20? Why so large cutoff? Generally FDR is chosen 0.01 or 0.05, very rarely 0.1 at most.7. Page 9: The authors do some GO annotation for similar genes and find cancer hallmark terms as a result. Using DC or just overlapping highly expressed genes can also give the same results. Is ADC superior in any sense? And just finding halmark terms doesn’t mean ADC could gain deep insights, since this is nothing new. It is just some kind of validation.8. Page 10: Figure 4 C and E. In the heatmap, hundreds of genes are reported but very small number of genes are shown in C and E. How are those genes selected? Why is particularly special about those genes? What are the node sizes proportional to? Why are some genes colored pink and others white?9. Page 11. Figure 5A. What are those cell types (HSC, MPP, etc.)? Since not every reader is a hematologist, these acronyms should be defined in the figure caption.10. Page 12, Line 287 and Fig. 6: “Thus, top HVGs of the five datasets are supposed to have many overlaps since they were all profiled about the similar cell types. However, we only found a few HVGs overlapped among these datasets (Fig. 6A).” The numbers in the heatmaps in Fig 6A and 6B are not comparable. If you start with 1000 genes for an analysis and 20,000 genes for the other, you will obviously get different numbers in total. The actual numbers are not comparable between two. Only clustering can be compared.11. Page 13, Figure 7C: Similar issue. Like Fig 4 C and E, how are those genes selected? Why is particularly special about those genes? What are the node sizes proportional to? Why are some genes colored pink and others white?12. Although the authors developed a method, they did not provide the computational implementation of their framework in any platform. In this context, this study has little benefit for the scientific community. The authors must compile a software package in a widely used open-source programming language (such as R) not a commercial software language (such as Matlab), so that any researcher interested in this method and can freely download and apply this method on their own data. The software must be deposited in a publicly accessible repository such as GitHub and properly documented with code comments, package reference and tutorials. Any researcher should be able to download and apply it in a couple hours without need to intensive effort.13. In addition to publishing the software package for the framework, the authors should compile a separate package for just re-producing all the results presented in this manuscript. The authors must provide all the code, scripts and documentation in addition to interim data to replicate the results presented. If the public datasets that are used is reasonable size, the authors should also deposit it in a repository for easy access. If some of the public data is very large (larger than > 100GB), the authors should provide links to download the data with information about how to process it.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: No: The computational code for the proposed method is not available.Reviewer #2: NoneReviewer #3: No: Although the authors developed a method, they did not provide the computational implementation of their framework in any platform. In this context, this study has little benefit for the scientific community.**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: NoFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols9 Aug 2021Submitted filename: ReviewReport_ADC_v5.pdfClick here for additional data file.31 Aug 2021Dear Dr. Zhang,Thank you very much for submitting your manuscript "Approximate distance correlation for selecting similarly distributed genes across datasets" for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that addresses Reviewer #2's comments.We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Mingyao LiAssociate EditorPLOS Computational BiologyJian MaDeputy EditorPLOS Computational Biology***********************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: The authors have addressed all the comments.Reviewer #2: The authors did not address the major concerns raised by Reviewers 2 and 3 regarding the validity of the proposed method ADC. Most importantly, the biological meanings of the genes selected by ADC are hard to interpret. Moreover, we do not agree with the authors that these genes should be referred to as "similarly distributed genes." We did a simulation study to show that a gene with the same distribution in two batches of cells cannot be captured by ADC. Hence, we are not convinced that ADC has a solid probability foundation or a biological relevance.Below please find our detailed comments.1. The biological meanings are unclear for the "similarly distributed genes" identified by ADC. The current ADC implementation tries to do the following. Take k = 4 as a toy example; for every tested gene, it selects (k-1) top correlated genes in both datasets, which may be {A, B, C} in dataset 1, and {D, E, F} in dataset 2; ADC will identify this tested gene as a "simiarly distributed gene" if and only if {tested gene, A, B, C} and {tested gene, D, E, F} are "correlated" in the sense of distance correlation (which roughly means that the 6 pairwise distances among {tested gene, A, B, C} and the 6 pairwise distances among {tested gene, D, E, F} are correlated). However, there is no guaranteed correspondence between {A, B, C} and {D, E, F}, and thus the distance correlation between {tested gene, A, B, C} and {tested gene, D, E, F} is biologically meaningless. Also, how does this correlation inform whether the tested gene is "similarly distributed" or not? Please see point 2 below.2. There is a gap between what ADC claims to identify and what ADC actually identifies. What ADC claims to identify are the genes that have similar distributions in two datasets, where they have m and n observations, respectively. ADC aims to perform the following hypothesis testing: for every tested gene, X \\in R^m, Y \\in R^n are 2 random vectors, representing that gene's count vectors in m and n cells, respectively; H_0: X and Y are independent vs. H_a: X and Y are not independent" (Page 5, Section "DC"). While ADC claims/aims to identify those genes that reject H_0, what ADC actually does is different (please refer to our comment 1 that ADC actually identifies those genes whose top correlated genes are "correlated" between the two datasets). In the claim, X and Y should both correspond to the tested gene; however, in the actual implmentation, X refers a not-well-defined random vector whose four observations are {tested gene, A, B, C}, while Y refers a not-well-defined random vector whose four observations are {tested gene, D, E, F}. Given that A, B, and C are selected based on their high correlations with the tested gene in the first data set, it is impossible to assume that {tested gene, A, B, C} are i.i.d. observations of X. Hence, the actual implmentation of ADC is invalid in the probability sense.We design the following simulation to demonstrate this gap:Generate X1, X2, ..., Xm, Y1, Y2, ..., Yn ~ N(0, I_p) i.i.d.Denote X = [X1, X2, ..., Xm], Y = [Y1, Y2, ..., Yn], where rows correspond to the p genes, and the m and n columns correspond to the m and n cells in two datasets.Run the python code provided by the authors.In the result, ADC does not identify any similar genes (FDR = 0.05).This result is expected because for every tested gene, the top k correlated genes in X and Y are independently simulated and thus not correlated, so no gene should be identified by the ADC method as implemented. ("the actual goal")However, by design, every gene is simulated from the same standard Gaussian distribution in the two datasets, so it perfectly satisfies the definition of "similarly distributed gene" and thus should be identified. ("the ideal goal")Therefore, we conclude that the actual goal of ADC is different from its ideal goal.3. To re-formulate ADC as a valid method, we deem it necessary for the authors to accurately and formally define "distribution" and "similar genes" using probability notations.4. A potential fix: for a tested gene, if its top k correlated genes in the two datasets are paired by external information (e.g., gene orthology), shared, or naturally paired, the ADC result may carry more biological meaning.Reviewer #3: My comments were addressed. The work is ready for publication.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: NoneReviewer #3: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: NoFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols12 Sep 2021Submitted filename: ReviewReport.pdfClick here for additional data file.10 Oct 2021Dear Dr. Zhang,We are pleased to inform you that your manuscript 'Approximate distance correlation for selecting highly interrelated genes across datasets' has been provisionally accepted for publication in PLOS Computational Biology.Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.Best regards,Mingyao LiAssociate EditorPLOS Computational BiologyJian MaDeputy EditorPLOS Computational Biology***********************************************************Reviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #2: To address our criticism on the statistical formulation, the authors attempted to rephrase their goal as to find genes that share some correlated structure between two datasets ("interrelated genes"). However, this new goal is still not statistcally sound: the correlation structure the approximate distance correlation (ADC) aims to capture cannot be written down in the probability language using random variables.To restate, this is what ADC does: given a tested gene, 1) selecting k top correlated genes in each dataset respectively; 2) pair the k top correlated genes by the order of Pearson correlation with the tested gene; 3) calculate Distance Correlation on the above k pairs.We maintain our position that calculating the distance correlation between two sets of top correlated genes, which are selected respectively in each dataset and ordered by correlation, is biologically meaningless. This concern, if not addressed, would make the entire ADC method invalid.1. A fundamental issue with the proposed ADC is that the input data of ADC are NOT the required input data of the distance correlation (DC). There are two unreconcilable differences:(1) DC requires the input data to be naturally paired, and the pairing cannot be inferred from the same input data. In contrast, the paired "observations" input into ADC are inferred from the same input data. In other words, ADC uses the same input data for twice: first for inferring the pairing and second for calculating the DC. This would result in invalid, uninterpretable, inflated DC values.(2) In ADC, the "observations" are NOT real observations but other genes. The authors' responses did not convince us that "correlated genes" of a given gene cannot regarded as observations of that one, an assumption that does not make either biological or statistical sense.2. If the authors want to modify their paper to become statistically sound, we deem it necessary that the authors pair genes by external information (e.g., gene orthology), so that genes are naturally paired.3. The new term "interrelated genes", which was modified from "similarly distributed genes", lacks both biological and statistical definitions.Reviewer #4: In this paper, the authors propose Approximate Distance Correlation (ADC) to select interrelated genes with statistical significance across two different biological datasets.Overall, I felt that the ADC method is meaningful and valid. The reviewer 2 made a good point that the previously-proposed concept "similarly distributed genes" is confusing and misleading. This has since been replaced by “interrelated genes” by the authors in this round of revision. In my opinion, the revisions made by the authors are responsive and adequate.I think the ADC concept is valid and useful. But I felt that ADC is motivated more from the mathematical side, instead of the biological side. The results derived from ADC made sense, but I think other simpler measures may achieve similar results. Anyway, I think this is a very minor point. At this stage, the authors do not have to address it.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #2: NoneReviewer #4: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #2: NoReviewer #4: No1 Nov 2021PCOMPBIOL-D-21-00727R2Approximate distance correlation for selecting highly interrelated genes across datasetsDear Dr Zhang,I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!With kind regards,Zsofia FreundPLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
Authors: Hannah A Pliner; Jonathan S Packer; José L McFaline-Figueroa; Darren A Cusanovich; Riza M Daza; Delasa Aghamirzaie; Sanjay Srivatsan; Xiaojie Qiu; Dana Jackson; Anna Minkina; Andrew C Adey; Frank J Steemers; Jay Shendure; Cole Trapnell Journal: Mol Cell Date: 2018-08-02 Impact factor: 17.970