| Literature DB >> 32244427 |
Maryam Zand1, Jianhua Ruan1,2.
Abstract
Single-cell RNA sequencing is a powerful technology for obtaining transcriptomes at single-cell resolutions. However, it suffers from dropout events (i.e., excess zero counts) since only a small fraction of transcripts get sequenced in each cell during the sequencing process. This inherent sparsity of expression profiles hinders further characterizations at cell/gene-level such as cell type identification and downstream analysis. To alleviate this dropout issue we introduce a network-based method, netImpute, by leveraging the hidden information in gene co-expression networks to recover real signals. netImpute employs Random Walk with Restart (RWR) to adjust the gene expression level in a given cell by borrowing information from its neighbors in a gene co-expression network. Performance evaluation and comparison with existing tools on simulated data and seven real datasets show that netImpute substantially enhances clustering accuracy and data visualization clarity, thanks to its effective treatment of dropouts. While the idea of netImpute is general and can be applied with other types of networks such as cell co-expression network or protein-protein interaction (PPI) network, evaluation results show that gene co-expression network is consistently more beneficial, presumably because PPI network usually lacks cell type context, while cell co-expression network can cause information loss for rare cell types. Evaluation results on several biological datasets show that netImpute can more effectively recover missing transcripts in scRNA-seq data and enhance the identification and visualization of heterogeneous cell types than existing methods.Entities:
Keywords: clustering; co-expression network; data imputation; graph random walk; scRNA-seq data
Year: 2020 PMID: 32244427 PMCID: PMC7230610 DOI: 10.3390/genes11040377
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
The scRNA-seq datasets used for benchmarking.
| Dataset | Num of Cells | Num of Clusters | Accession No | Ref |
|---|---|---|---|---|
| Human brain | 420 | 8 | GSE67835 | [ |
| Deng | 286 | 10 | GSE45719 | [ |
| Pollen | 300 | 11 | SRP041736 | [ |
| Usoskin | 622 | 4 | GSE102827 | [ |
| Zeisel | 3005 | 7 | GSE60361 | [ |
| Baron human | 8569 | 14 | GSE84133 | [ |
| Treutlein | 80 | 5 | GSE52583 | [ |
Figure 1(Color online) netImpute improves data visualization and clustering performance on simulated data. (a) The first two dimensions calculated from the raw datasets with different dropout rate; (b) the first two PCs calculated from imputed datasets by netImpute; (c) clustering accuracy for both raw and imputed datasets using PCA+kmeans clustering method.
Figure 2The netImpute utilizing information from gene similarity significantly enhances the clustering accuracy of real datasets.Violin plot showing the ARI obtained by netImpute-treated data based on cell similarity and gene similarity for different parameter settings ( and k) and based on the protein–protein interaction (PPI) network for different values of . Values on top of the violin plots show the p-value obtained from t-test comparing the adjusted rand indices (ARIs) from imputed data with ARI from raw data.
Figure 3Clustering performance of data treated by netImpute compare to other existing methods. (a) Bar plot showing ARI obtained from applying PCA+kmeans on raw and imputed datasets (b) Bar plot showing ARI obtained from applying SC3 on raw and imputed datasets.
Treutlein confusion matrix.
| PCA + Kmeans on Raw Data | SC3 on Raw Data | PCA + Kmeans on Imputed Data | SC3 on Imputed Data | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cell Type | c1 | c2 | c3 | c4 | c5 | c1 | c2 | c3 | c4 | c5 | c1 | c2 | c3 | c4 | c5 | c1 | c2 | c3 | c4 | c5 |
| AT1 | 18 | 0 | 20 | 3 | 0 | 8 | 0 | 33 | 0 | 0 | 33 | 0 | 0 | 8 | 0 | 41 | 0 | 0 | 0 | 0 |
| AT2 | 1 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 12 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 |
| BP | 1 | 0 | 9 | 0 | 3 | 0 | 7 | 3 | 0 | 3 | 0 | 3 | 0 | 10 | 0 | 1 | 9 | 3 | 0 | 0 |
| Ciliated | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 3 | 0 |
| Clara | 0 | 10 | 1 | 0 | 0 | 0 | 1 | 0 | 10 | 0 | 0 | 0 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 11 |
Figure 4(Color online) Cell clustering and 2D PCA plot of human brain dataset. (a) no imputation; (b) after application of SAVER; (c) after application of netImpute; (d) after application of netSmooth; (e) after imputing with MAGIC; (f) after application of scImpute.
Run time of netImpute and other imputation methods.
| Dataset | netImpute | MAGIC | SAVER-doFast | SAVER | netSmooth | scImpute |
|---|---|---|---|---|---|---|
| Baron (8569 cells) | 270 (s) | 90 (s) | 7396(s) | 5 (days) | 2640 (s) | 29944 (s) |
| Zeisel (3005 cells) | 315 (s) | 14 (s) | 5104 (s) | 4 (days) | 1320 (s) | 26848 (s) |
Figure 5Scatter plot of gene’ degree in the network vs. its mean imputed values across all cells for netImpute (a) and netSmooth (b). Scatter plot of gene’ degree in the network vs. its log mean imputed values across all cells for netImpute (c) and netSmooth (d).