Literature DB >> 35271564

A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data.

Snehalika Lall¹, Sumanta Ray^2,3, Sanghamitra Bandyopadhyay¹.

Abstract

Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. There are various issues in single cell sequencing that effect homogeneous grouping (clustering) of cells, such as small amount of starting RNA, limited per-cell sequenced reads, cell-to-cell variability due to cell-cycle, cellular morphology, and variable reagent concentrations. Moreover, single cell data is susceptible to technical noise, which affects the quality of genes (or features) selected/extracted prior to clustering. Here we introduce sc-CGconv (copula based graph convolution network for single clustering), a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell-cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach. sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. The learned representation (low dimensional embedding) is utilized for cell clustering. sc-CGconv features the following advantages. a. sc-CGconv works with substantially smaller sample sizes to identify homogeneous clusters. b. sc-CGconv can model the expression co-variability of a large number of genes, thereby outperforming state-of-the-art gene selection/extraction methods for clustering. c. sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. d. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35271564 PMCID： PMC8979455 DOI： 10.1371/journal.pcbi.1009600

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

This is a PLOS Computational Biology Methods paper.

Introduction

Recent developments of single cell RNA-seq (scRNA-seq) technology made it possible to generate a huge volume of data allowing the researcher to measure and quantify RNA levels on large scales [1]. This has led to a greater understanding of the heterogeneity of cell population, disease states, cell types, developmental lineages, and many more. A fundamental goal of scRNA-seq data analysis is cell type detection [2, 3]. The most immediate and standard approach performs clustering to group the cells, which are later labeled with specific type [4, 5]. This provides an unsupervised method of grouping similar cells into clusters that facilitate the annotation cells with specific types present in the large population of scRNA-seq data [3, 6, 7]. The standard pipeline of downstream analysis of scRNA-seq data starts from the processing of the raw count matrix, and goes through the following steps [8, 9]: i) normalization (and quality control) of the raw count matrix ii) gene selection, and cell filtering iii) dimensionality reduction, iv) unsupervised clustering of cells into groups (or clusters) and v) annotation of cells by assigning labels to each cluster. Clustering of cells is not a distinct process in the downstream analysis, instead is a combination of several steps starting from step-(i) to step-(iv). Each step has an immense impact on the cell clustering process. A good clustering (or classifying cell samples) can be ensured by the following characteristics of features obtained from the step-(iii): the features should contain information about the biology of the system, should not have features containing random noise, and should preserve the structure of data while reducing the size as much as possible. Although there are a plethora of methods [1, 10–14] available for performing each task within the pipeline, the standard approaches consider a common sequence of steps for the preprocessing of scRNA-seq data [15, 16]. This includes normalization by scaling of sample-specific size factors, log transformation, and gene selection by using the coefficient of variation (highly variable genes [5, 17]) or by using average expression level (highly expressed genes). Scanpy used several dispersion based methods [1, 18] for selecting highly variable genes (HVG). In Seurat package [19], standardized variance is calculated from the normalized data to find out the HVGs. Alternatively, some methods exist for gene selection, such as GLM-PCA [15] selects genes by ranking genes using deviance, M3Drop [12] selects genes leveraging the dropouts effects in the scRNA-seq data. After gene selection and dimensionality reduction, most of the methods for single cell clustering followed a Louvain/Leiden based graph clustering method (Scanpy and Seurat). The standard approaches for gene selection (or feature extraction) fail to produce a stable and predictive feature set for higher dimension scRNA-seq data [20]. Moreover, the existing approaches overlook the cellular heterogeneity and patterns across transcriptional landscapes, which ultimately affects cell clustering. This motivates to go for a robust and stable technique that can deal with the larger dimension of the single cell data, while preserving the cell-to-cell variability. Here, we introduce sc-CGconv, stepwise feature extraction and clustering approach that leverages the Copula-based dependency measure [13] and its implication for identifying stable features from large scRNA-seq data. Notably, this approach largely and effectively masks all the aforementioned limitations associated with other feature selection and clustering approaches in unsupervised cases. sc-CGconv takes a two-step strategy: first, a structure-aware gene sampling based on Local Sensitive Hashing (LSH) is performed to obtain a sub-optimal gene set [21]. In the next step, a copula-based multivariate dependency measure is utilized to map the cells into a graph which is further learned by a graph convolution network. A robust-equitable copula correlation (Ccor) is utilized for constructing the cell-cell graph. The first step ensures preserving the cell-to-cell dependence structure within the sub-sample of genes, while the second step puts all the cells into an encompassing context retaining the dependency structure among the cells, resulting in a topology-preserving embedding of the fine-grained graph using the GCN. The latent embedding resulting from the trained GCN is utilized for clustering. The advantages of sc-CGconv are: a new robust-equitable copula correlation (Ccor) measure for constructing cell-cell graph leveraging the scale-invariant property of Copula, and reducing the computational cost of processing large datasets due to the use of structure-aware using LSH. Furthermore, to highlight the potency of sc-CGconv over the existing methods, we compared our method with seven well-known gene selection and clustering methods of scRNA-seq data: Gini-clust [22], GLM-PCA [15], M3Drop [12], Fano factor, and HVG followed by clustering of Seurat V3/V4 [23], scGeneFit [24] followed Kmeans clustering, and SC3 clustering [4]. For the methods which are specific for gene selection (GLM-PCA, M3drop) we perform kmeans clustering after selecting the informative genes. We further carry out a stability test to prove the efficacy of sc-CGconv for producing topology preserving embedding of cells from scRNA-seq data in the presence of technical noise. The results show that sc-CGconv not only can extract the most informative and non-redundant features but is also less sensitive towards technical noise present in the scRNA-seq data. We demonstrate in experiments that (i) sc-CGconv leads to a pure clustering of cells in scRNA-seq data, (ii) the annotation of cells is accurate for unknown test samples (iii) the marker genes which are identified in the annotation step have a clear capability to segregate the cell types in the scRNA-seq data, and (iv) sc-CGconv can handle substantially large data with utmost accuracy.

Results

Overview of sc-CGconv

In the following, we present the workflow of the proposed method sc-CGconv.

sc-CGconv: Workflow

See Fig 1 for the workflow of our analysis pipeline. We describe all important steps in the paragraphs of this subsection.

Fig 1

Workflow of the analysis.

A. scRNA-seq count matrix are downloaded and preprocessed using linnorm. B. LSH based sampling is performed on the preprocessed data to obtain a subsample of features. C. A cell neighbourhood graph is constructed using copula correlation. D. A three layer graph convolution neural network is learned with adjacency matrix and node feature matrix as input. It aggregates information over neighbourhoods to update the representation of nodes. The final representation obtained is called graph embedding which is utilized for cell clustering.

Workflow of the analysis.

A. Preprocessing of raw datasets

See Fig 1A. Raw scRNA-seq datasets are obtained from public data sources. The counts matrix , where c is number of cells and g represents number of genes, is normalized using a transformation method called (Linnorm) [25]. We choose cells having more than 1000 genes with non-zero expression values and choose genes having a minimum read count greater than 5 in at least 10% among all the cells. log2 normalization is employed on the transformed matrix by adding one as a pseudo count.

B. Structure-aware feature sampling using LSH

See Fig 1B. Here, LSH is used to partition the data points (genes) of the preprocessed count matrix into different buckets. Locality Sensitive Hashing (LSH) [26-28] operates on a reduced dimension to find approximate nearest neighbors. LSH uses special locality-sensitive hash functions where the chances of similar objects hashing into the same bucket are more than the collision of dissimilar objects, [29]. By virtue of the hash function the relative distance between the data points in the input space is preserved in the output space. A k-nn graph is formed by searching the five nearest neighbours within the bucket for each gene. A sub-sample of genes is obtained by performing a greedy approach for selecting the genes within each bucket. It results in a subset of genes, that are considered as feature set, F = {f : s = 1, ⋯, m} for the cell-cell graph construction in the next stage. Here the aim is to find out an important non-redundant subset of features (genes) while preserving the structure of the data.

C. Constructing cell-cell graph using copula correlation

See Fig 1C. A robust equitable correlation measure Ccor is utilized to measure the dependence between each pair of cells over the sampled transcriptome obtained from the LSH step. For each node (cell) a ranked list of nodes (cells) is generated based on the Ccor scores. We assume a cell pair having a larger Ccor value shares the most similar transcriptomic profile. Next, a k-nearest neighbour graph is constructed based on the ranked list of each node (cell).

D. Learning low dimensional embedding from cell-cell-graph

See Fig 1D. We employ a network embedding strategy (here: Graph Convolution Network [30]), which extracts node features from the constructed cell-cell graph. In detail, GCN has the advantage that it can utilize the power of convolution neural network to encode the relationship between samples. The graph structure (generally represented as adjacency matrix) together with the information encoded in each node is utilized in the NN. We encode the entire graph (adjacency matrix) into a fixed-size, low-dimensional latent space. Thus GCN encoder preserves the properties of all the nodes (cells) relative to their encompasses in the network. The result of this step is a feature matrix where rows refer to nodes (cells) and columns refer to the inferred network features.

Training of graph convolution network on cell-cell graph

To train the GCN model with our datasets, we first randomly split the cell-cell graph into an 8:1:1 ratio of the train, validation, and test sets. The test edges are not included in the training set, however, we keep all the nodes of the graph in the training set. Now, we train the model using the training edges and check the performance of the trained model for recovering the removed test edges. The model is trained with 50 epochs using Adam optimizer with a learning rate of 0.001 and a dropout rate of 0.1. Adam [31] is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based on training data. It is appropriate for problems with very sparse gradients. The rectified linear activation function (ReLU) is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero. It has become the default activation function for most of the neural networks models because of its simplicity and better performance. The learning rate is initially chosen as 0.001 as it gives better performance in our experiments. Table 1 shows the average precision and receiver operating characteristic (ROC) score (ROC plot is given in Fig A in S1 File) for the four networks obtained from the datasets. We took the low dimensional embeddings from the output of the encoder of the trained model.

Table 1

Performance of GCN on networks created from four datasets.

First two columns of the table shows total number of edges and number of nodes of the four networks. The rest of the columns show ROC and average precision score for validation and test edges. V. ROC and V. AP refer to validation ROC and validation average precision score, whereas T. ROC and T. AP refer to the same for test set.

Dataset	#edges	#nodes	V. ROC	V. AP	T. ROC	T. AP
Baron [32]	41876	8569	87.32	87.08	85.87	86.39
Klein [33]	13885	2717	84.79	83.21	83.46	82.81
Melanoma [34]	340875	68579	83.38	86.48	83.1	82.30
PBMC68k [1]	342890	68793	84.98	86.78	82.9	83.8

Performance of GCN on networks created from four datasets.

sc-CGconv can produce topology-preserving single-cell embedding

The resulting embedding of sc-CGconv can be utilized to generate the single-cell embeddings for clustering. Here we compare sc-CGconv with three manifold learning and graph drawing algorithms such as UMAP, t-SNE, and ForceAtlas2 to quantify the quality of resulting embedding. To see how similar the topology of low-dimensional embedding (within the latent space Z) is to the topology of the high-dimensional space (X), we adopted a procedure similar to Wolf et al. [35]. Here, we define a classification setup where the ground truth is defined as a kNN graph G = (V, E) fitted in the high dimensional space X. The edge set E which defines all possible edges is the state space of the classification problem. In this setting, the embedding algorithm predicts whether an edge e ∈ E is an element of E. We put label ‘1’ for the edge e ∈ E if e ∈ E, otherwise put label ‘0’. For each edge e ∈ E, the embedding will put label ‘1’ with the probability q and put label ‘0’ with probability 1 − q. The cost function to train such a classifier is formed as a binary cross-entropy function H(P, Q) or logloss, which is equivalent to the negative log-likelihood of the labels under the model. It is defined as Now the Kullback-Leibler (KL) [36] divergence of the predicted distribution Q and the reference distribution P is measured as KL(P, Q) = H(P, Q) − H(P), where , which ultimately leads to the equation We measured the KL divergence between P and Q for t-SNE [37], UMAP [38], ForceAtlas2 [39], PHATE [40], SAUCIE [41], and the sc-CGconv. Fig 2 shows the statistics of KL measures for the different embeddings in the four used datasets.

Fig 2

Performance of different embedding algorithms on four datasets.

Kl divergence (KL div) is computed by rerunning embedding algorithms 50 times.

Performance of different embedding algorithms on four datasets.

Kl divergence (KL div) is computed by rerunning embedding algorithms 50 times.

Comparison with State-Of-The-Art Methods

In scRNA-seq datasets, single cells are the unit of analysis, and it is crucial to correctly identify the clusters to which they belong. These reference clusters are typically based on the expression profiles of many cells. Misclassification of cells is the common issue for annotating clusters as single-cell gene expression datasets often show a high level of heterogeneity even within a given cluster. To establish the efficacy of sc-CGconv over such procedures, we have selected seven state-of-the-art methods that are widely used for gene selection and clustering of the single cell data. Here, we compare sc-CGconv with the following seven methods I) Gini Clust [22]: a feature selection scheme using Gini-index followed by density-based spatial clustering of applications with noise, DBSCAN [42]. II) GLM-PCA [15]: a multinomial model for feature selection and dimensionality reduction using generalized principal component analysis (GLM-PCA) followed by k-means clustering III) Seurat V3/V4 [23]: a single cell clustering pipeline which selects Highly Variable Gene (HVG) that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others) followed by Louvain clustering IV) Fano Factor [43], a measure of dispersion among features. Features having the maximum Fano Factor are chosen for Kmeans clustering. V) scGeneFit [24], a marker selection method that jointly optimizes cell label recovery using label-aware compressive classification, resulting in a substantially more robust and less redundant set of markers. Vi) SC3 [4], a single-cell consensus (Kmeans) clustering (SC3) tool for unsupervised clustering of scRNA-seq data. Vii) M3drop [44] which takes advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features. The R package for Gini-Clust [22] is employed with default settings. For GLM-PCA and HVG, we consider the default settings as described in [15, 23]. For scGene-Fit, the parameters (redundancy = 0.25, and method = ’centers’) are used as suggested by their github page. For SC3 we adopted the default parameters for clustering all the datasets. sc-CGconv requires the number of iterations (iter) as the input parameters of LSH step. we set as iter as 1 for all datasets. The parameter of Clayton Copula (see section Method for details) is set as θ = −0.5. This θ is set as it gives better performance in our experiments. For GCN, we used 3-layer GCN architecture which performs three propagation steps during the forward pass and convolves 3rd order neighborhood of every node. We take the dimension of the output layer of the first and second layers as 256 and 128. For the decoder, we use a simple inner product decoding scheme. All experiments were carried out on Linux server having 50 cores and x86_64 platform. To validate the clustering results, we utilized two performance metrics, Adjusted Rand Index (ARI) and Silhouette Score [45]. Table 2 depicts the efficacy of sc-CGconv over the other methods. For the other competing methods, we select the top 1000 features/genes in the gene selection step and use the default clustering technique meant for this method. For sc-CGconv, after obtaining the low dimensional embedding of 128-dimension, we performed a simple k-means for clustering the cells. It is evident from the Table that sc-CGconv provides higher ARI (and average silhouette width) values for all four datasets. To know the effectiveness of the feature extraction method within the sc-GCconv, we have replaced it with PCA while keeping the other procedure intact. The last column of Table 2 shows the results of sc-Gcconv with PCA as feature extractor.

Table 2

Comparison with state-of-the-art: Adjusted Rand Index (ARI) and Average Silhouette Width (ASW) are reported for seven competing methods on four datasets.

Dataset	Method
	sc-CGconv		Gini Clust		GLM-PCA+Kmeans		Fano+Kmeans		Seurat
	ARI	ASW	ARI	ASW	ARI	ASW	ARI	ASW	ARI	ASW
Baron [32]	0.68	0.52	0.6	0.48	0.42	0.4	0.52	0.46	0.62	0.47
Melanoma [34]	0.43	0.45	0.56	0.52	0.15	0.29	0.18	0.24	0.42	0.29
Klein [33]	0.86	0.8	0.76	0.7	0.43	0.58	0.4	0.3	0.8	0.72
PBMC [1]	0.50	0.3	0.51	0.46	0.38	0.29	0.31	0.26	0.29	0.14
	scGeneFit+Kmeans		SC3		M3drop		sc-GCconv (PCA)
	ARI	ASW	ARI	ASW	ARI	ASW	ARI	ASW
Baron [32]	0.62	0.43	0.60	0.4	0.54	0.48	0.60	0.49
Melanoma [34]	0.25	0.4	0.38	0.35	0.33	0.26	0.38	0.34
Klein [33]	0.82	0.75	0.80	0.66	0.67	0.54	0.71	0.76
PBMC [1]	0.47	0.48	0.48	0.31	0.35	0.3	0.41	0.30

sc-CGconv preserves cell-to-cell variability

Once features or low dimensional embedding are estimated to be important, it is essential to ask whether the cell-to-cell variability has been preserved within this low dimensional space. To determine this, we computed the Euclidean distance between each pair of cells, both in original dimension and in low dimension space. Thus two Euclidean distance matrices are obtained, one for the original feature space, and the other for the reduced dimension. The Correlation score (Kendall τ) is computed between these two distance matrices. Fig 3 depicts the correlation measures for all the competitive methods in the four scRNA-seq datasets.

Fig 3

Correlation score between two distance matrices, defined on original and reduced dimension.

Figure shows the comparisons among the competing methods based on the correlation scores (Kendall τ) obtained from four different scRNA-seq datasets.

Correlation score between two distance matrices, defined on original and reduced dimension.

Figure shows the comparisons among the competing methods based on the correlation scores (Kendall τ) obtained from four different scRNA-seq datasets.

sc-CGconv can identify marker genes

We followed the conventional procedure of Scanpy to find out markers (DE genes) from the clustering results. The clustering is performed on the low dimensional embedding obtained from the sc-GCconv. Scanpy utilized the Leiden graph clustering method to group the cells into different clusters. Wilcoxon rank-sum test [46] is utilized to find out the significant (p < 0.05) DE genes for each cluster which are treated as marker genes. We took the top 50 marker genes with their p-value threshold of 0.05 on the PBMC 68k dataset. We found that 19 marker genes from the Melanoma dataset, 12 marker genes from the PBMC dataset, 8 marker genes from the Baron dataset, and 4 marker genes from the Klein dataset, are biologically significant according to CellMarker database [47]. The list of biologically significant marker genes for all four datasets is given in Table A in S1 File. The results of marker gene analysis of the Melanoma data can be found in S2 Text, Table A, and Figs B and C in S1 File.

Execution time

All experiments were carried out on a Linux server having 50 cores and X86_64 platform. As our proposed method is a deep learning based feature extraction method, it takes more time than any filter-based feature selection technique (e.g. Fano, Gini − clust). To check how the competing methods scale with the number of cells (and classes) we performed an analysis. We have generated simulated data (using splatter) by varying the number of cells (and classes). Four simulated data are generated with the number of cells and classes as follows: 500 cells with two classes, 1000 cells with three classes, 1500 cells with four classes, and 2000 cells with five classes. All data are generated with equal group probabilities, 2000 number of features, fixed dropout rate (0.2), and 40% DE gene proportion. 1000 features are selected in each compared method and 128 embedded features are chosen using sc-CGconv. The runtime is compared with all seven different competing methods. The execution time (minute) for each dataset is given in Table 3.

Table 3

Execution time in minute for eight competing methods.

Datasets	# Cells	# Class	Execution Time (in Minute)
			sc-CGconv	Gini Clust	GLM-PCA	Fano	Seurat	scGeneFit	SC3	M3drop
Data1	500	2	9	2	1	1	3	3	5	4
Data2	1000	3	13	2	1	1	7	5	8	6
Data3	1500	4	17	3	1	3	11	10	13	12
Data4	2000	5	20	5	3	5	14	13	17	15

Conclusions

In this paper, we have developed a step-wise sampling based feature extraction method for scRNA-seq data leveraging the Copula dependency measure with graph convolution network. On one hand, LSH based sampling is used to deal with ultra-large sample size, whereas Copula dependency is utilized to model the interdependence between features (genes) to construct the cell-cell graph. Graph convolution network has been utilized to learn low dimensional embedding of the constructed graph. There are four striking characteristics of the proposed method: I) It can sample a subset of features from original data keeping the structure intact. Therefore, minor clusters are not ignored. This sampling is achieved by using the LSH based sampling method. II) sc-CGconv utilizes scale invariant dependency measure which gives a superior and stable measure for constructing the dependency graph among the cells. III) GCN provides topology-preserving low dimensional embedding of the cell graphs. It can effectively capture higher-order relations among cells. IV) LSH based structure-aware sampling of features showed a significant lift in the accuracy (Correlation, ARI values) in large single cell RNA-seq datasets. Another important observation is that sc-CGconv yields the highest ARI values for Klein and Pollen datasets in comparison to other State-Of-The-Art methods. The rationale behind this is that sc-CGconv utilized copula correlation measure, which correctly models the correlations among the feature set. In the holistic viewpoint, the sc-CGconv algorithm performs much better than the other methods. The computation time of sc-CGconv is equivalent to the number of sampled features. The process may be computationally expensive when a large number of features are selected in the LSH step. However, as copula correlation returns stable and non-redundant features, in reality, a small set of selected features will be effective to construct the cell-cell graph. We observed in scRNA-seq data 1000 sampled features will serve the purpose. Taken together, the proposed method sc-CGconv not only outperforms in topology preserving generation of cell embedding but also can able to identify good clusters for large single cell data. It can be observed from the results that sc-CGconv leads both in the domain of single cell clustering by extracting informative features and generating low dimensional embedding of cells from large scRNA-seq data. The results prove that sc-CGconv may be treated as an important tool for computational biologists to investigate the primary steps of downstream analysis of scRNA-seq data.

Method

Overview of datasets

We used four public single-cell RNA sequence datasets downloaded from Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo/ and 10X genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets). Table 4 shows a summary of the used datasets. See S1 Text in S1 File for a detailed description of the datasets.

Table 4

A brief summary of the dataset used here.

Dataset	Dataset Descrition	#Features	#Instances	#Class
Baron [32]	Human pancreas cell	20125	8569	8
Klein [33]	Mouse Embryo Cell	24175	2717	4
Melanoma [34]	Human Tumor Cell	19783	68579	14
PBMC68k [1]	Human Blood tissue	32738	68793	11

The formal details of sc-CGconv

Copula

The term Copula [48] originated from a Latin word Copulare, which means ‘join together’. The Copula is utilized in several domains in statistics to obtain joint distributions from uniform marginal distributions. Following the famous Sklar’s theorem, Copula (C) function is defined as follows [49, 50] Let, (U1, U2, ⋯, U) be the random variables whose marginal distributions are uniform over [0, 1]. A copula function C : [0, 1] → [0, 1] is defined as the joint distribution: Sklar’s theorem extends this definition to more general random variables with possibly non-uniform marginals. The theorem states that, for any set of n random variables (X1, …, X), their joint cumulative distribution function H(x1, ⋯, x) = P [X1 ≤ x1 … X ≤ x] can be described in terms of its marginals F(x) = P [X ≤ x] and a Copula function C, formally stated as: Among several categories of Copulas [51], Clayton Copula from Archimedean family is one of the most widely used function for high dimensional datasets [52].

Clayton Copula

Let, ϕ be a strictly decreasing function such that ϕ(1) = 0, and ϕ[−1](x) is the pseudo inverse of ϕ(t) such that ϕ[−1](x) = ϕ−1(x) for x ∈ [0, ϕ(0)) and ϕ[−1](t) = 0 for x ≥ ϕ(0). Let U1, U2, …, U be the random variables having uniform marginal distributions. Then, the general family of Archimedean copula is described as, where, ϕ(.) is called the generator function. The Clayton Copula is a particular Archimedian copula when the generator function ϕ is given by, with θ ∈ [−1, ∞)0).

Copula based correlation measure (Ccor)

We model the dependence between two random variables using Kendall tau(τ) [53] measure. Note that we defined Kendall tau correlation using copula based dependence measure. In [54], Ding et al. first proposed a way to define the correlation by using copula dependence measure. They proposed Ccor as a robust and equitable measure which is defined as half the L1-distance of the copula density function from independence. In [13, 55]Ccor was defined in terms of Kendall’s tau (τ) measure. Here we retain the definition of Ccor proposed in [13]. Kendall’s tau (τ) is the measure of concordance between two variables; defined as the probability of concordance minus the probability of discordance. Formally this can be expressed as The concordance function (Q) is the difference of the probabilities between concordance and discordance between two vectors (x1, y1) and (x2, y2) of continuous random variables with different joint distribution H1 and H2 and common margins F and F. It can be proved that the function Q depends on the distribution of (x1, y1) and (x2, y2) only through their copulas. According to Nelson [48], there is a relation between Copula and Kendall τ that can be expressed as: Where, u ∈ F(x) and v ∈ F(y). τ(C) is termed as Ccor in our study. Here the copula density C(u, v) is estimated through the clayton copula defined in the previous section. We have used τ to model the dependency between transcriptomic profiles among the cells.

Feature extraction using sc-CGconv

sc-CGconv takes a stepwise approach for feature extraction from the scRNA-seq data: first, it obtains a sub-sample of genes using locality sensitive hashing, next it generates a cell neighborhood graph by utilizing the copula correlation (Ccor) measure, and finally, a graph representation learning algorithm (here GCN) is utilized to get the low dimensional embedding of the constructed graph.

Structure preserving feature sampling using LSH

LSH [28] reduces the dimensionality of higher dimension datasets using an approximate nearest neighbor approach. LSH uses a random hyperplane based hash function, which maps similar objects into the same bucket. LSH is used to partition the data points (genes) of the preprocessed count matrix (M′) into k (here k = 10) different buckets such that |G′| > 2, where is the set of genes in M′. A k-nn graph is formed by searching the five nearest neighbours within the bucket for each gene. Each gene is visited sequentially in the same order as it appears in the original dataset and is added to the selected list while discarding its nearest neighbours. If the visited gene is discarded previously, then it will be skipped and its neighbors will be discarded. Thus a sub-sample of genes is obtained, which is further down-sampled by performing the same procedure recursively. The number of iterations for downsampling is user defined and generally depends on the size of the data points. We use cosine distance to compute the nearest neighbours of a gene. LSHForest [56] python package is utilized to implement the whole process. Thus, a subset of m number of genes, where,(m < ge′) are obtained from the above sampling stage. These genes are considered as feature set, F = {f : s = 1, ⋯, m} for the next stage of cell-cell graph construction.

Cell neighbourhood graph construction

For graph construction, we rank each node (cell) according to the Ccor values. For a node (cell) we compute the Ccor values to all of its possible pairs. A k-nearest neighbour list is prepared for each node based on the Ccor values. A high value of Ccor assumes there is a high similarity between the cell pair over the transcriptomic profile, while a smaller value signifies low similarity. The output of this step is an adjacency matrix representing the connection among the cells according to the k-nearest neighbour list and a node feature matrix storing the Ccor values for each node pair.

Extracting node features using GCN

We have utilized graph convolution network (GCN) [30] to learn the low dimensional embedding of nodes from the cell-cell graph. Given a graph G = (V, E), the goal is to learn a function of signals/features on G which takes i) A optional feature matrix X ∈ N × D, where x is a feature description for every node i, N is the number of nodes and D is the number of input features and ii) A description of the graph structure in matrix form which typically represents adjacency matrix A as inputs, and produces a node-level output Z ∈ N × F, where F represents the output dimension of each node feature. The graph-level outputs are modeled by using indexing operation, analogous to the pooling operation uses in standard convolutional neural networks [57]. In general every layer of neural network can be described as a non-linear function: where H(0) = X and H( = Z, l representing the number of layers, f(., .) is a non linear activation function like ReLU. Following the definition of layer-wise propagation rule proposed in [30] the function can be written as where , I represents identity matrix, is the diagonal node degree matrix of , , W represents trainable weight matrix of the neural network. Intuitively, the graph convolution operator calculates the new feature of a node by computing the weighted average of the node attribute of the node itself and its neighbours. The operation ensures identical embedding of two nodes if the nodes have identical neighboring structures and node features. We adopted the GCN architecture similar to [30], a 3-layer GCN architecture with randomly initialized weights. For the cell-cell graph, we take the adjacency matrix (A) of the neighbourhood graph and put identity matrix (I) as the node feature matrix. The 3-layer GCN performs three propagation steps during the forward pass and effectively convolves the 3rd-order neighborhood of every node. The encoder uses a graph convolution neural network (GCN) on the entire graph to create the latent representation The encoder works on the full adjacency matrix A ∈ R and X ∈ R is the node feature matrix, obtained from LSH step. Here we used a simple inner product decoder that try to reconstruct the main adjacency matrix A. The decoder function follows the following equation: where z, z reflect the representations of nodes i, j, as computed by the encoder. The trained model is applied to the test edges to see how effectively it can discover the deleted edges (see ‘Training of GCN on cell-cell graph’ in Result section). After training and evaluation of the model, the low dimensional embedding is kept and used in the cell clustering task.

Supporting information.

S1 Text: Overview of the datasets. S2 Text: Marker analysis with sc-CGconv. Table A: Marker genes identified from the clustering results with sc-CGconv. Fig A: Performance of GCN on networks created from four datasets: receiver operating characteristic (ROC) curve for the validation is given for four datasets (see Table 1 of the main text for ROC score). Fig B: Marker analysis using sc-CGconv. After clustering DE genes are identified using clustering of PBMC data with sc-CGconv and results of marker genes on ultra large PBMC datasets. Panel-A. 2D UMAP visualization of PBMC dataset with original cell annotations. Panel-B. 2D UMAP visualization of clustering results with sc-CGconv. Panel-C. visualization of 12 markers which are overlayed based on their expression –low (blue) to high (yellow)- on the reference PBMC UMAP plot. Fig C: Figure depicts the results of marker gene analysis on melanoma datasets. Panel-A. 2D UMAP visualization of melanoma data with original annotation. Panel B. 2D UMAP visualization of clustering results with sc-CGconv. Panel-C. visualization of 9 markers which are overlayed based on their expression –low (blue) to high (yellow). (PDF) Click here for additional data file. 11 Dec 2021 Dear Dr. Ray, Thank you very much for submitting your manuscript "A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data" for consideration at PLOS Computational Biology. Reviewers have now commented on your paper. You will see that they are advising that you revise your manuscript. If you are prepared to undertake the work required, I would be pleased to reconsider my decision. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Quan Zou Guest Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************** Reviewers have now commented on your paper. You will see that they are advising that you revise your manuscript. If you are prepared to undertake the work required, I would be pleased to reconsider my decision. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this study, Lall et al introduce sc-CGconv (Copula based graph convolution network for single cell clustering). This is an unsupervised feature extraction and clustering approach that uses copula correlation (Ccor) followed by graph convolution based clustering approach r to extract biologically relevant gene clusters obtained from single cell RNASeq data. The authors compare their method with 5 other popular methods of gene selection and clustering and discuss the advantages of their method over other popular methods for scRNA seq cell clustering. Overall, the manuscript is well written and provides a new method for understanding biologically meaningful signals arising out of scRNA seq data. The manuscript could be considered fit for publication once the following issues are addressed: 1. Line 35: “The standard approaches for gene selection (or feature extraction) are failed..” Consider rephrasing to: “The standard approaches for gene selection (or feature selection) fail to…” 2. Line 37: “Moreover, the exiting approaches…” Typo here. “Moreover, existing approaches...” 3. Line 56: “The advantages of sc-CGconv come selecting a new robust equitable …” Consider substituting ‘come’ or use another appropriate phrase.” 4. Figure 1: Capitalize ‘W’ in figure legend. Also a typo is observed: “adjacencey matrix”. 5. Line 70: Why is ‘experiments’ italicized? 6. Line 71: “(ii) the annotation of cells are accurate ...” Replace “are” with “is” 7. Lines 81 and 96: It is seen that 81 and 96 state “See -A of figure 1” and “See C in Figure 1” whereas Lines 88 and 103 state “See panel-B” and “See panel-D”. Please consider mentioning either panel-A or Figure 1A,1B (whichever is appropriate). 8. Line 84: Consider rephrasing. “… a thousand genes expression...” This sentence could be written as “We choose cells having more than 1000 genes with non-zero expression values…” Also, the authors could provide an explanation of how the structure of the data matrix is preserved. How does LSH achieve that? 9. Line 96: “A robust equitable correlation measure Ccor”. The manuscript would greatly benefit if the authors provided an overview of the copula correlation concept to a greater extent. How does this method compare to other correlation metrics? A brief literature survey about the usage of this concept is also warranted to drive home the usage of this correlation measure. Which other fields use copula-based correlation and to what extent are they helpful? 10. Line 111: “The result of this step is a feature matrices (Fi)..” Grammatical error – “… is a feature matrix (Fi)..” 11. Line 112: Which network features are inferred here? Unclear. 12. Line 120: Consider providing an explanation behind the use of Adam optimizer and ReLu as activation function. A comment on the learning rate for the optimization procedure would be helpful. 13. Line 122: Manuscript could benefit by the addition of a figure for the ROC curve, in addition to the ROC statistics in table1. 14. Table1: Table legend states –“First two columns of the table shows total number of nodes and number of edges..” whereas in the table itself number of edges is written first followed by number of nodes. 15. Line 125: Wolf et al should be capitalized. Manuscript states “…wolf et al”. 16. Line 126: Consider writing the full name of Kullback-Leibler (KL) divergence since this is the first time 17. Line 129: “We measured the KL divergenece”. Typo here. 18. Line 132: The correct usage of the phrase is “state of the art” and not “arts”. Please rectify. Also in Table 2 (Table caption). 19. Figure 2: ‘KL div’ as a short form for KL divergence should be mentioned in the legend. 20. Line 157: What is the mathematical interpretation of the Clayton Copula? How does the value of theta matter? What is its physical implication towards the performance of scCGconv? 21. Line 174: “Once features / (low dimensional embedding) …” Please consider replacing / with or and remove parentheses. It is confusing. 22. Line 179 and Figure 3 (Legend): What do the authors mean by “..(Kandle tau “ in line 179 and “Kandel Tau in Figure 3 legend..”? Do the authors intend to say Kendall tau? Please rectify if this is the case. 23. Line 180: grammatical error - “…for all competitive method…” should be - “…for all competitive methods…” 24. Line 182: This section should be considered for rewriting. How exactly are the authors identifying marker genes using their method? A detailed explanation is warranted here. 25. Also line 187 states 19 markers from melanoma and 13 from PBMC dataset are significant; whereas table 3 states 19 and 12. Which numbers are correct? 26. Line 188: If cell marker is the name of the database, please capitalize the name as in Cell Marker? 27. Table 3: What do the authors mean by pubmed id of the markers? The numbers in parenthesis do not refer to the NCBI gene id or Ensembl Gene Stable ID. Please provide an explanation of what nomenclature of genes is used here. Also why do all 3 CD8 T cell markers have the same ID i.e. 28622514 although they correspond to 3 different genes? Are these IDs specific to the Cell Marker database used by the authors? 28. Line 189: What are the markers identified by scCGconv for other 2 datasets (Baron and Klein)? It would be beneficial to have a list of these markers as a Supplementary Table. 29. Line 209: What is meant by Human Klein and Pollen? Please explain. 30. In general, how does this method compare to other NN utilizing scRNAseq methods like PHATE and SAUCIE? Please comment on this. 31. It would be good to see how the list of features selected by this method compares to lists of genes selected by other methods (for example methods that use standardized variance / Fano or Gini index) 32. It would also be useful to see an exploration of the properties of genes. For example, Gini picks up genes expressed in few cells, Fano picks up genes with large variance with respect to the mean expression. What are the variational properties of the genes picked up by their method? 33. The authors don't do any intermediate dimensionality reduction (typically one would do feature selection, run PCA, and use the PCs for further computation rather than genes). The construction of the cell kNN graph is done directly in the selected feature space. This could be justified since their gene selection is supposed to remove redundant information, so running PCA might not add much. But this is another place where a formal comparison of the two approaches would be worthwhile to see. PCA and their gene selection method are both essentially trying to remove redundant information; so if one compares their approach with one where there is no feature selection and only PCA (and using PCs for further steps), what does the comparison look like? One example comparison: are the genes that contribute heavily to top PCs in PCA the same (or similar) as the genes retained by their feature selection method? Reviewer #2: In this manuscript, Lall et al. proposed a novel feature selection and clustering approach named sc-CGconv for single-cell RNA-seq data. sc-CGconv first performs LSH based sampling to obtain structure-aware features and then constructs cell-cell graph using copula correlation. Finally, low dimensional embedding is extracted from graph convolution network. Overall, Sc-CGconv shows great performance in comparison with state-of-the-art clustering methods, it may be useful for researcher in this field. I would like to see the authors address the following comments. 1. M3Drop is an important and popular feature selection method tailed for scRNA-seq data, the authors also mentioned it for several times in the manuscript. However, there is no comparison with it in Table 2. I would suggest the authors to add it to get a more comprehensive comparison. 2. As we know, deep learning is a powerful tool, but it often tends to consume a lot of time. To let users fully understand the characteristics of the proposed approach, the authors should provide the running time of all the methods used in this study for all four datasets. 3. After I successfully installed the R package “copulaGCN” following the authors’ instructions at Github (https://github.com/Snehalikalall/CopulaGCN), running “library(copulagcn)” still got an error. The correct name of the package may be “ucfs”? Please check the package name and modify the correct usage in the README. 4. I believe all feature selection methods could easily identify several marker genes, therefore it is not necessary to highlight Table 3. I would urge the authors to put Table 3 in the supplementary files. 5. In the legend A of Fig1, method “limnorm” does not exist. The authors may refer to “linnorm”. 6. Several important reviews concerning feature selection and clustering of scRNA-seq data are missing, such as: 1) Zhang Z, Cui F, Wang C, Zhao L, Zou Q: Goals and approaches for each processing step for single-cell RNA sequencing data. Briefings in bioinformatics 2021, 22(4). 2) Zhu X, Li H-D, Guo L, Wu F-X, Wang J: Analysis of Single-Cell RNA-seq Data by Clustering Approaches. Current Bioinformatics 2019, 14(4):314-322. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 19 Jan 2022 Submitted filename: Review of GCN_Plos.pdf Click here for additional data file. 27 Jan 2022 Dear Dr. Ray, We are pleased to inform you that your manuscript 'A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Quan Zou Guest Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The questions and concerns raised in the review process were satisfactorily addressed. The research was performed with rigor and the results presented were easy to interpret. I recommend that the manuscript should be accepted for publication. Reviewer #2: I am satisfied with the answers the authors provided to my questions and the revisions they added to the manuscript and supplementary information. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 7 Mar 2022 PCOMPBIOL-D-21-01961R1 A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data Dear Dr Ray, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Anita Estes PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

33 in total

1. Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics.

Authors: Mireya Plass; Jordi Solana; F Alexander Wolf; Salah Ayoub; Aristotelis Misios; Petar Glažar; Benedikt Obermayer; Fabian J Theis; Christine Kocks; Nikolaus Rajewsky
Journal: Science Date: 2018-04-19 Impact factor: 47.728

2. Cell type transcriptome atlas for the planarian Schmidtea mediterranea.

Authors: Christopher T Fincher; Omri Wurtzel; Thom de Hoog; Kellie M Kravarik; Peter W Reddien
Journal: Science Date: 2018-04-19 Impact factor: 47.728

3. Exploring single-cell data with deep multitasking neural networks.

Authors: Matthew Amodio; David van Dijk; Krishnan Srinivasan; Guy Wolf; Smita Krishnaswamy; William S Chen; Hussein Mohsen; Kevin R Moon; Allison Campbell; Yujiao Zhao; Xiaomei Wang; Manjunatha Venkataswamy; Anita Desai; V Ravi; Priti Kumar; Ruth Montgomery
Journal: Nat Methods Date: 2019-10-07 Impact factor: 28.547

4. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure.

Authors: Maayan Baron; Adrian Veres; Samuel L Wolock; Aubrey L Faust; Renaud Gaujoux; Amedeo Vetere; Jennifer Hyoje Ryu; Bridget K Wagner; Shai S Shen-Orr; Allon M Klein; Douglas A Melton; Itai Yanai
Journal: Cell Syst Date: 2016-09-22 Impact factor: 10.304

5. Linnorm: improved statistical analysis for single cell RNA-seq expression data.

Authors: Shun H Yip; Panwen Wang; Jean-Pierre A Kocher; Pak Chung Sham; Junwen Wang
Journal: Nucleic Acids Res Date: 2017-12-15 Impact factor: 16.971

6. SC3: consensus clustering of single-cell RNA-seq data.

Authors: Vladimir Yu Kiselev; Kristina Kirschner; Michael T Schaub; Tallulah Andrews; Andrew Yiu; Tamir Chandra; Kedar N Natarajan; Wolf Reik; Mauricio Barahona; Anthony R Green; Martin Hemberg
Journal: Nat Methods Date: 2017-03-27 Impact factor: 28.547

7. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model.

Authors: F William Townes; Stephanie C Hicks; Martin J Aryee; Rafael A Irizarry
Journal: Genome Biol Date: 2019-12-23 Impact factor: 13.583

8. RgCop-A regularized copula based method for gene selection in single-cell RNA-seq data.

Authors: Snehalika Lall; Sumanta Ray; Sanghamitra Bandyopadhyay
Journal: PLoS Comput Biol Date: 2021-10-19 Impact factor: 4.475

9. GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.

Authors: Lan Jiang; Huidong Chen; Luca Pinello; Guo-Cheng Yuan
Journal: Genome Biol Date: 2016-07-01 Impact factor: 13.583

10. CellMarker: a manually curated resource of cell markers in human and mouse.

Authors: Xinxin Zhang; Yujia Lan; Jinyuan Xu; Fei Quan; Erjie Zhao; Chunyu Deng; Tao Luo; Liwen Xu; Gaoming Liao; Min Yan; Yanyan Ping; Feng Li; Aiai Shi; Jing Bai; Tingting Zhao; Xia Li; Yun Xiao
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971