| Literature DB >> 35271564 |
Snehalika Lall1, Sumanta Ray2,3, Sanghamitra Bandyopadhyay1.
Abstract
Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. There are various issues in single cell sequencing that effect homogeneous grouping (clustering) of cells, such as small amount of starting RNA, limited per-cell sequenced reads, cell-to-cell variability due to cell-cycle, cellular morphology, and variable reagent concentrations. Moreover, single cell data is susceptible to technical noise, which affects the quality of genes (or features) selected/extracted prior to clustering. Here we introduce sc-CGconv (copula based graph convolution network for single clustering), a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell-cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach. sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. The learned representation (low dimensional embedding) is utilized for cell clustering. sc-CGconv features the following advantages. a. sc-CGconv works with substantially smaller sample sizes to identify homogeneous clusters. b. sc-CGconv can model the expression co-variability of a large number of genes, thereby outperforming state-of-the-art gene selection/extraction methods for clustering. c. sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. d. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space.Entities:
Mesh:
Year: 2022 PMID: 35271564 PMCID: PMC8979455 DOI: 10.1371/journal.pcbi.1009600
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Workflow of the analysis.
A. scRNA-seq count matrix are downloaded and preprocessed using linnorm. B. LSH based sampling is performed on the preprocessed data to obtain a subsample of features. C. A cell neighbourhood graph is constructed using copula correlation. D. A three layer graph convolution neural network is learned with adjacency matrix and node feature matrix as input. It aggregates information over neighbourhoods to update the representation of nodes. The final representation obtained is called graph embedding which is utilized for cell clustering.
Performance of GCN on networks created from four datasets.
First two columns of the table shows total number of edges and number of nodes of the four networks. The rest of the columns show ROC and average precision score for validation and test edges. V. ROC and V. AP refer to validation ROC and validation average precision score, whereas T. ROC and T. AP refer to the same for test set.
| Dataset | #edges | #nodes | V. ROC | V. AP | T. ROC | T. AP |
|---|---|---|---|---|---|---|
| Baron [ | 41876 | 8569 | 87.32 | 87.08 | 85.87 | 86.39 |
| Klein [ | 13885 | 2717 | 84.79 | 83.21 | 83.46 | 82.81 |
| Melanoma [ | 340875 | 68579 | 83.38 | 86.48 | 83.1 | 82.30 |
| PBMC68k [ | 342890 | 68793 | 84.98 | 86.78 | 82.9 | 83.8 |
Fig 2Performance of different embedding algorithms on four datasets.
Kl divergence (KL div) is computed by rerunning embedding algorithms 50 times.
Comparison with state-of-the-art: Adjusted Rand Index (ARI) and Average Silhouette Width (ASW) are reported for seven competing methods on four datasets.
| Dataset | Method | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| sc-CGconv | Gini Clust | GLM-PCA+Kmeans | Fano+Kmeans | Seurat | ||||||
| ARI | ASW | ARI | ASW | ARI | ASW | ARI | ASW | ARI | ASW | |
| Baron [ | 0.68 | 0.52 | 0.6 | 0.48 | 0.42 | 0.4 | 0.52 | 0.46 | 0.62 | 0.47 |
| Melanoma [ | 0.43 | 0.45 | 0.56 | 0.52 | 0.15 | 0.29 | 0.18 | 0.24 | 0.42 | 0.29 |
| Klein [ | 0.86 | 0.8 | 0.76 | 0.7 | 0.43 | 0.58 | 0.4 | 0.3 | 0.8 | 0.72 |
| PBMC [ | 0.50 | 0.3 | 0.51 | 0.46 | 0.38 | 0.29 | 0.31 | 0.26 | 0.29 | 0.14 |
| scGeneFit+Kmeans | SC3 | M3drop | sc-GCconv (PCA) | |||||||
| ARI | ASW | ARI | ASW | ARI | ASW | ARI | ASW | |||
| Baron [ | 0.62 | 0.43 | 0.60 | 0.4 | 0.54 | 0.48 | 0.60 | 0.49 | ||
| Melanoma [ | 0.25 | 0.4 | 0.38 | 0.35 | 0.33 | 0.26 | 0.38 | 0.34 | ||
| Klein [ | 0.82 | 0.75 | 0.80 | 0.66 | 0.67 | 0.54 | 0.71 | 0.76 | ||
| PBMC [ | 0.47 | 0.48 | 0.48 | 0.31 | 0.35 | 0.3 | 0.41 | 0.30 | ||
Fig 3Correlation score between two distance matrices, defined on original and reduced dimension.
Figure shows the comparisons among the competing methods based on the correlation scores (Kendall τ) obtained from four different scRNA-seq datasets.
Execution time in minute for eight competing methods.
| Datasets | # Cells | # Class | Execution Time (in Minute) | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| sc-CGconv | Gini Clust | GLM-PCA | Fano | Seurat | scGeneFit | SC3 | M3drop | |||
| Data1 | 500 | 2 | 9 | 2 | 1 | 1 | 3 | 3 | 5 | 4 |
| Data2 | 1000 | 3 | 13 | 2 | 1 | 1 | 7 | 5 | 8 | 6 |
| Data3 | 1500 | 4 | 17 | 3 | 1 | 3 | 11 | 10 | 13 | 12 |
| Data4 | 2000 | 5 | 20 | 5 | 3 | 5 | 14 | 13 | 17 | 15 |
A brief summary of the dataset used here.
| Dataset | Dataset Descrition | #Features | #Instances | #Class |
|---|---|---|---|---|
| Baron [ | Human pancreas cell | 20125 | 8569 | 8 |
| Klein [ | Mouse Embryo Cell | 24175 | 2717 | 4 |
| Melanoma [ | Human Tumor Cell | 19783 | 68579 | 14 |
| PBMC68k [ | Human Blood tissue | 32738 | 68793 | 11 |