| Literature DB >> 33088619 |
Peng Wu1, Mo An1, Hai-Ren Zou1, Cai-Ying Zhong1, Wei Wang1, Chang-Peng Wu1.
Abstract
BACKGROUND: Single-cell RNA-sequencing (scRNA-seq) technology is a powerful tool to study organism from a single cell perspective and explore the heterogeneity between cells. Clustering is a fundamental step in scRNA-seq data analysis and it is the key to understand cell function and constitutes the basis of other advanced analysis. Nonnegative Matrix Factorization (NMF) has been widely used in clustering analysis of transcriptome data and achieved good performance. However, the existing NMF model is unsupervised and ignores known gene functions in the process of clustering. Knowledges of cell markers genes (genes that only express in specific cells) in human and model organisms have been accumulated a lot, such as the Molecular Signatures Database (MSigDB), which can be used as prior information in the clustering analysis of scRNA-seq data. Because the same kind of cells is likely to have similar biological functions and specific gene expression patterns, the marker genes of cells can be utilized as prior knowledge in the clustering analysis.Entities:
Keywords: NMF model; Semi-supervised; Single cell RNA-seq
Year: 2020 PMID: 33088619 PMCID: PMC7571410 DOI: 10.7717/peerj.10091
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1(A-B) An illustration of how to construct the weight matrix Q (C).
(A) The heatmap is an ideal simulated gene expression matrix X; (B) the heatmap is part of the matrix X that only selects rows of marker genes.
Published ten scRNA-seq datasets used to test rssNMF model.
All the datasets are scRNA-seq data of human or mouse embryos.
| Dataset | Units | GSE/ArrayExpress Number | Number of cells | Species | Number of Clusters |
|---|---|---|---|---|---|
| Biase | FPKM |
| 56 | Mouse | 5 |
| Goolam | CPM | E-MTAB-3321 | 124 | Mouse | 5 |
| Yan | RPKM |
| 124 | Human | 9 |
| Shin | RPKM |
| 256 | Mouse | 10 |
| Deng | RPKM |
| 259 | Mouse | 10 |
| Leng | Normalized counts |
| 460 | Human | 4 |
| Kowalczyk | TPM |
| 564 | Mouse | 8 |
| Camp | FPKM |
| 734 | Human | 9 |
| Chu_1 | TPM |
| 758 | Human | 6 |
| Chu_2 | TPM |
| 1,018 | Human | 7 |
| Tasic | RPKM |
| 71,585 | Mouse | 7 |
| Zeisel | Counts |
| 60,361 | Mouse | 8 |
Notes.
fragments per kilobase of transcript per million mapped reads
reads per kilobase of transcript per million mapped reads
counts per million mapped reads
Benchmarking of rssNMF against other clustering method.
All the algorithms were applied 50 times to each dataset. Parameter α for rNMF and rssNMF: 2. Parameter β for rssNMF: 2. Prior information: for each dataset, we randomly select one cluster and use 20 marker genes of the selected cluster to construct the weight matrix.
| Dataset | KMeans | HC | NMF | SC3 | rNMF | ssNMF | rssNMF |
|---|---|---|---|---|---|---|---|
| Biase | 0.712 | 0.761 | 0.774 | 0.844 | 0.806 | 0.796 | 0.862 |
| Goolam | 0.304 | 0.310 | 0.387 | 0.731 | 0.43 | 0.642 | 0.657 |
| Yan | 0.375 | 0.570 | 0.533 | 0.805 | 0.572 | 0.675 | 0.710 |
| Shin | 0.167 | 0.217 | 0.282 | 0.366 | 0.282 | 0.327 | 0.370 |
| Deng | 0.42 | 0.399 | 0.466 | 0.775 | 0.52 | 0.547 | 0.682 |
| Leng | 0.057 | 0.009 | 0.112 | 0.179 | 0.14 | 0.165 | 0.213 |
| Kowalczyk | 0.182 | 0.176 | 0.269 | 0.307 | 0.304 | 0.293 | 0.365 |
| Camp | 0.232 | 0.225 | 0.274 | 0.327 | 0.3 | 0.297 | 0.305 |
| Chu_1 | 0.177 | 0.199 | 0.22 | 0.205 | 0.241 | 0.326 | 0.369 |
| Chu_2 | 0.204 | 0.242 | 0.314 | 0.312 | 0.314 | 0.322 | 0.357 |
| Tasic | 0.51 | 0.284 | 0.705 | 0.822 | 0.711 | 0.791 | 0.790 |
| Zeisel | −4.97E−05 | −9.36E−04 | 2.43E−03 | −5.60E−04 | 2.65E−03 | 0.007 | 0.014 |
Figure 2Performance of rssNMF versus parameter.
The rssNMF is stable with respect to the parameter—and achieve good performance varies from 2 to 32.
Figure 3Factorizing matrices W (basis matrix: A, D, G), H (coefficient matrix: B, E, H) and consensus matrix (C, F, I) respectively obtained from NMF (A, B, C), rNMF (D, E, F) and rssNMF (G, H, I) for dataset Yan with 124 cells and nine clusters.
The annotation color bar denotes nine clusters. The rows annotation of W and columns of H indicate the assignment of genes and samples for clusters. The paramete α = 2 for rNMF and rssNMF and β = 2 for rssNMF.