| Literature DB >> 34252925 |
Dongyuan Song1, Kexin Li2, Zachary Hemminger3,4, Roy Wollman3,4,5, Jingyi Jessica Li2,6,7,8.
Abstract
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data.Entities:
Mesh:
Year: 2021 PMID: 34252925 PMCID: PMC8275345 DOI: 10.1093/bioinformatics/btab273
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1An overview of scPNMF. Taking a log-transformed gene-by-cell count matrix as the input, scPNMF first learns a low-dimensional sparse weight matrix W and a low-dimensional cell embedding matrix S. Second, it removes the bases irrelevant to cell-type variations by examining bases’ functional annotations (optional), Pearson correlations with cell library sizes, and multimodality. Given a user-defined gene number M, scPNMF performs M-truncation to facilitate two main applications: (1) selecting the desired number of informative genes; (2) projecting new targeted gene profiling data onto the low-dimensional space defined by reference scRNA-seq data. The details are in the Methods section.
Comparison of the properties of PNMF, PCA and NMF
| Optimization problem | Non-negativity | Sparsity | Mutual exclusiveness | New data projection | |
|---|---|---|---|---|---|
| PNMF |
| Yes | Very high | Very high | Yes |
| PCA |
| No | Low | Low | Yes |
| NMF |
| Yes | High | High | No |
Fig. 2Illustration of the sparse and interpretable projection found by scPNMF. We use the FregGold dataset as an example. (a) Comparison of the weight matrices of PCA and PNMF. Heatmaps visualize the learned weight matrices of PCA (top) and PNMF (bottom), where rows are genes and columns are bases. Red represents positive weights while blue represents negative weights. The rows are ordered by gene-wise hierarchical clustering. Compared to PCA, the weight matrix of PNMF is strictly non-negative, much more sparse and mutually exclusive between bases. (b) GO analysis result of each basis in the weight matrix of PNMF. Texts in black boxes summarize the functions of genes in each basis. The enriched GO terms are almost mutually exclusive, implying that each basis represents a unique gene functional cluster. (c) Statistical tests on each basis in the score matrix of PNMF. Top row: scatter plots of scores and total log-counts (cell library sizes). Each dot represents a cell. Cell scores in bases 1 and 4 are highly correlated with cell library sizes. Bottom row: histograms of cell scores in each basis. Scores in bases 2 and 3 show strong multimodality patterns (adjusted P-value ). (d) UMAP visualizations of cells based on high weight genes in the unselected bases 1 and 4 and those in the selected bases 2, 3 and 5. Genes in the unselected bases completely fail to distinguish the three cell types, while genes in the selected bases lead to a clear separation of the three cell types.
Fig. 3Benchmarking scPNMF against 11 informative gene selection methods on seven scRNA-seq datasets. (a) Clustering accuracies (ARI values) of three clustering methods based on the informative genes selected. Gene selection methods are ordered from left to right by their average ARI across the three clustering methods and the seven datasets. (b) UMAP visualization of cells in the Zheng4 dataset based on 100 informative genes selected by each method. Genes selected by scPNMF lead to a clear separation between naive cytotoxic T cells and regulatory T cells, while the genes selected by others methods do not.
Prediction accuracy of cell types based on 100 informative genes selected by 12 gene selection methods in the two case studies with paired reference scRNA-seq data and targeted gene profiling data
| Method | Zheng8 | PBMC | Average Accuracy | ||||
|---|---|---|---|---|---|---|---|
| RandomForest | KNN | SVM | RandomForest | KNN | SVM | ||
| scPNMF | 0.85 (0.83,0.87) |
|
|
|
|
|
|
| M3Drop | 0.85 (0.83,0.87) |
|
|
| 0.77 (0.71,0.82) | 0.63 (0.57,0.69) | 0.79 |
| SeuratDISP | 0.84 (0.81,0.86) | 0.78 (0.75,0.81) | 0.86 (0.84,0.88) | 0.80 (0.75,0.84) | 0.75 (0.70,0.80) | 0.64 (0.58,0.70) | 0.78 |
| corFS | 0.80 (0.77,0.82) | 0.75 (0.73,0.78) | 0.82 (0.80,0.85) | 0.82 (0.77,0.86) | 0.81 (0.76,0.86) | 0.62 (0.56,0.68) | 0.77 |
| GiniClust |
| 0.79 (0.76,0.81) | 0.86 (0.83,0.88) | 0.80 (0.75,0.84) | 0.76 (0.71,0.81) | 0.53 (0.47,0.60) | 0.75 |
| scran | 0.79 (0.76,0.81) | 0.72 (0.69,0.75) | 0.82 (0.80,0.85) | 0.78 (0.72,0.82) | 0.73 (0.67,0.78) | 0.67 (0.61,0.72) | 0.75 |
| SeuratMVP | 0.83 (0.81,0.85) | 0.77 (0.74,0.80) | 0.85 (0.82,0.87) | 0.82 (0.77,0.86) | 0.74 (0.69,0.79) | 0.47 (0.40,0.53) | 0.74 |
| Scanpy | 0.79 (0.77,0.82) | 0.71 (0.68,0.74) | 0.80 (0.78,0.83) | 0.80 (0.75,0.84) | 0.76 (0.71,0.81) | 0.52 (0.46,0.58) | 0.73 |
| SCMarker | 0.77 (0.74,0.79) | 0.68 (0.65,0.71) | 0.74 (0.71,0.77) | 0.77 (0.71,0.81) | 0.71 (0.65,0.76) | 0.45 (0.39,0.52) | 0.69 |
| SeuratVST | 0.73 (0.70,0.76) | 0.68 (0.65,0.71) | 0.75 (0.73,0.78) | 0.74 (0.68,0.79) | 0.68 (0.63,0.74) | 0.40 (0.34,0.46) | 0.67 |
| DANB | 0.71 (0.68,0.73) | 0.69 (0.66,0.71) | 0.75 (0.73,0.78) | 0.73 (0.67,0.78) | 0.74 (0.68,0.79) | 0.28 (0.23,0.34) | 0.65 |
| irlbaPcaFS | 0.68 (0.65,0.71) | 0.61 (0.58,0.64) | 0.71 (0.68,0.74) | 0.71 (0.65,0.76) | 0.77 (0.71,0.82) | 0.16 (0.12,0.21) | 0.61 |
Parentheses are 95% confidence intervals. Highest number within each column is labeled by boldface and underline.