| Literature DB >> 35688990 |
Snehalika Lall1, Sumanta Ray2,3, Sanghamitra Bandyopadhyay4.
Abstract
A fundamental problem of downstream analysis of scRNA-seq data is the unavailability of enough cell samples compare to the feature size. This is mostly due to the budgetary constraint of single cell experiments or simply because of the small number of available patient samples. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic cell samples. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying the standard procedures of downstream analysis. LSH-GAN outperforms the benchmarks for realistic generation of quality cell samples. Experimental results show that generated samples of LSH-GAN improves the performance of the downstream analysis such as feature (gene) selection and cell clustering. Overall, LSH-GAN therefore addressed the key challenges of small sample scRNA-seq data analysis.Entities:
Mesh:
Year: 2022 PMID: 35688990 PMCID: PMC9187761 DOI: 10.1038/s42003-022-03473-y
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Fig. 1Workflow of LSH-GAN.
a Gene selection task in HDSS scRNA-seq data using generated samples with LSH-GAN model. b Detail architecture of LSH-GAN.
LSH-GAN algorithm.
| Data Matrix ( | |
| Generated data ( | |
| 1: | |
| 2: |
|
| 3: | augment |
| 4: | real data |
|
| |
| 5: | |
|
| |
| 6: | |
| 7: | |
| {The adaptive momentum gradient decent rule is used in our experiment.} | |
| 8 | |
| 9: | Execute Locality Sensitive Hashing (LSH) on |
| 10: | |
| 11: | visit each data point sequentially in the order as it appears in data. |
| 12: | if the data point is not visited earlier, select the data point and discard all its |
| 13: | |
| 14: |
Wasserstein distance between generated and real data distribution.
| Nearest neighbor | Model | Epoch | |||
|---|---|---|---|---|---|
| 10,000 | 15,000 | 20,000 | 25,000 | ||
| Wassertein distance | |||||
| LSH-GAN | |||||
| LSH-GAN | 1.09 | 0.89 | 0.83 | 0.82 | |
| LSH-GAN | 1.36 | 0.89 | 1.45 | 0.87 | |
| LSH-GAN | 1.53 | 1.35 | 1.19 | 0.83 | |
| GAN | 1.71 | 1.73 | 1.75 | 1.70 | |
Model is trained on synthetic data of size 100 × 1000 Gaussian mixture data with two non-overlapping classes.
The minimum distances are represented as bold face.
Fig. 2A toy example demonstrating generation of a two dimensional data of known distribution.
Results show the distribution of generated data and real data for traditional GAN (upper row, a) and LSH-GAN (lower row, b).
Fig. 3Comparisons of LSH-GAN with the state-of-the-arts on the melanoma data.
a–c UMAP visualization of real and generated cell samples of melanoma data. d UMAP visualization of real scRNA-seq data with the original labels. e Expression values (shown in color bar) of two marker genes CD8A (marker of CD8 T cell) and MS4A1 (marker of B cell) in real and generated data. f Barplot describing the Wasserstein distance between the generated and real cell sample.
Table shows results of applying random forest classifier for discriminating real and generated samples coming from different competing methods.
| AUC Score | |||||
|---|---|---|---|---|---|
| Yan | Darmanis | Pollen | Klein | Melanoma | |
| cscGAN | 0.65 ± 0.02 | 0.68 ± 0.01 | 0.64 ± 0.02 | 0.62 ± 0.02 | 0.66 ± 0.01 |
| Splatter | 0.69 ± 0.01 | 0.69 ± 0.02 | 0.67 ± 0.03 | 0.65 ± 0.01 | 0.72 ± 0.02 |
| SUGAR | 0.67 ± 0.02 | 0.66 ± 0.03 | 0.61 ± 0.02 | 0.64 ± 0.02 | 0.68 ± 0.01 |
| GAN | 0.72 ± 0.02 | 0.71 ± 0.02 | 0.73 ± 0.02 | 0.72 ± 0.02 | 0.76 ± 0.03 |
| f-GAN | 0.69 ± 0.01 | 0.70 ± 0.02 | 0.63 ± 0.02 | 0.62 ± 0.02 | 0.61 ± 0.03 |
| w-GAN | 0.61 ± 0.01 | 0.67 ± 0.03 | 0.63 ± 0.02 | 0.61 ± 0.02 | 0.64 ± 0.02 |
| LSH-GAN | |||||
The average AUC score (with 5-fold cross-validation) is reported for each dataset.
The lowest AUC scores are highlighted with bold face.
The table shows adjusted Rand index (ARI), and normalized mutual information (NMI) scores of clustering results on real-life scRNA-seq data.
| Dataset | FS Method | Clustering results on scRNA-seq data | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| using features from combined data | using features from original data | ||||||||||||
| LSH-GAN | SUGAR | cscGAN | Splatter | GAN | |||||||||
| ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ARI | NMI | ||
| Darmanis | GLM-PCA | 0.634 | 0.66 | 0.413 | 0.43 | 0.531 | 0.54 | 0.42 | 0.43 | 0.129 | 0.15 | 0.4 | 0.41 |
| Fano Factor | 0.535 | 0.54 | 0.319 | 0.328 | 0.457 | 0.467 | 0.38 | 0.4 | 0.27 | 0.28 | 0.34 | 0.36 | |
| CV2 Index | 0.598 | 0.61 | 0.42 | 0.453 | 0.51 | 0.53 | 0.481 | 0.51 | 0.461 | 0.48 | 0.457 | 0.462 | |
| M3Drop | 0.648 | 0.665 | 0.513 | 0.537 | 0.58 | 0.59 | 0.507 | 0.52 | 0.48 | 0.512 | 0.46 | 0.48 | |
| HVG (Seurat V4) | 0.68 | 0.702 | 0.51 | 0.54 | 0.556 | 0.573 | 0.539 | 0.54 | 0.46 | 0.472 | 0.43 | 0.427 | |
| Yan | GLM-PCA | 0.895 | 0.9 | 0.709 | 0.713 | 0.798 | 0.8 | 0.715 | 0.72 | 0.62 | 0.63 | 0.66 | 0.678 |
| Fano Factor | 0.821 | 0.843 | 0.79 | 0.8 | 0.801 | 0.81 | 0.768 | 0.77 | 0.73 | 0.75 | 0.713 | 0.72 | |
| CV2 Index | 0.891 | 0.913 | 0.801 | 0.812 | 0.825 | 0.84 | 0.793 | 0.81 | 0.719 | 0.743 | 0.7 | 0.723 | |
| M3Drop | 0.898 | 0.904 | 0.802 | 0.82 | 0.796 | 0.81 | 0.79 | 0.823 | 0.761 | 0.783 | 0.71 | 0.732 | |
| HVG (Seurat V4) | 0.91 | 0.917 | 0.811 | 0.82 | 0.891 | 0.9 | 0.802 | 0.81 | 0.81 | 0.83 | 0.8 | 0.81 | |
| Pollen | GLM-PCA | 0.835 | 0.82 | 0.78 | 0.77 | 0.819 | 0.8 | 0.793 | 0.8 | 0.788 | 0.77 | 0.78 | 0.76 |
| Fano Factor | 0.933 | 0.913 | 0.878 | 0.86 | 0.916 | 0.88 | 0.88 | 0.87 | 0.815 | 0.8 | 0.712 | 0.7 | |
| CV2 Index | 0.94 | 0.921 | 0.906 | 0.88 | 0.908 | 0.89 | 0.89 | 0.86 | 0.831 | 0.81 | 0.81 | 0.8 | |
| M3Drop | 0.918 | 0.9 | 0.864 | 0.854 | 0.897 | 0.87 | 0.79 | 0.77 | 0.758 | 0.74 | 0.735 | 0.723 | |
| HVG (Seurat V4) | 0.958 | 0.93 | 0.916 | 0.9 | 0.897 | 0.876 | 0.868 | 0.85 | 0.801 | 0.79 | 0.82 | 0.81 | |
| Klein | GLM-PCA | 0.815 | 0.79 | 0.769 | 0.75 | 0.784 | 0.76 | 0.731 | 0.71 | 0.581 | 0.57 | 0.66 | 0.64 |
| Fano Factor | 0.8 | 0.78 | 0.742 | 0.72 | 0.782 | 0.761 | 0.77 | 0.76 | 0.699 | 0.66 | 0.796 | 0.77 | |
| CV2 Index | 0.82 | 0.79 | 0.71 | 0.7 | 0.761 | 0.75 | 0.709 | 0.69 | 0.69 | 0.67 | 0.68 | 0.65 | |
| M3Drop | 0.837 | 0.824 | 0.794 | 0.77 | 0.769 | 0.74 | 0.718 | 0.7 | 0.61 | 0.6 | 0.607 | 0.59 | |
| HVG (Seurat V4) | 0.898 | 0.86 | 0.861 | 0.84 | 0.857 | 0.83 | 0.785 | 0.77 | 0.73 | 0.71 | 0.739 | 0.71 | |
Data generated by the five competing methods are utilized for gene selection. Five gene selection methods are utilized to find out the most variable genes, which are further used for clustering of original scRNA-seq data. The last column represents the clustering results using the selected features from the original scRNA-seq data.
Fig. 4Clustering results of Pollen and Yan datasets.
a Two-dimensional UMAP visualization of clustering results (original and predicted labels). b Consensus clustering plots of obtained clusters.
A brief summary of the datasets used in the experiments.
| # Serial | Dataset name | Features | Instances | Class |
|---|---|---|---|---|
| 1 | Yan[ | 20,214 | 90 | 7 |
| 2 | Klein[ | 24,175 | 2717 | 4 |
| 3 | Darmanis[ | 22,088 | 466 | 9 |
| 4 | Pollen[ | 23,794 | 299 | 11 |
| 5 | Melanoma[ | 19,783 | 68,579 | 14 |