Literature DB >> 34252925

scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling.

Dongyuan Song¹, Kexin Li², Zachary Hemminger^3,4, Roy Wollman^3,4,5, Jingyi Jessica Li^2,6,7,8.

Abstract

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data.
RESULTS: Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data.
AVAILABILITY AND IMPLEMENTATION: The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34252925 PMCID： PMC8275345 DOI： 10.1093/bioinformatics/btab273

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The recent development of single-cell RNA sequencing (scRNA-seq) technologies provides unprecedented opportunities to decipher transcriptome heterogeneity among individual cells (Birnbaum, 2018; Potter, 2018; Zhu ). A typical scRNA-seq dataset contains thousands to tens of thousands of genes; however, a subset of genes, which we call informative genes, are usually sufficient for representing the underlying biological variations of cells in the dataset for two reasons. First, variations of many genes are not related to the biological variations of interest. For instance, fluctuations in the expression levels of housekeeping genes are irrelevant to cell types (Eisenberg and Levanon, 2013; Thellin ). Second, many genes have strongly correlated expression levels, suggesting that one gene may represent a group of genes without much loss of information (Subramanian ). Therefore, for scRNA-seq data analysis, informative gene selection has three advantages: (i) enhancing biological signals by removing unwanted technical variations, (ii) improving the interpretability of analysis results by focusing on informative genes and (iii) reducing the number of genes to save computational resources. Besides scRNA-seq data analysis, informative gene selection is also crucial for designing single-cell targeted gene profiling experiments, which we define to include all technologies that measure only a specific set of genes’ expression levels in individual cells. Unlike scRNA-seq, targeted gene profiling requires a limited number (often no more than hundreds) of genes to be specified before sequencing. Examples of targeted gene profiling include spatial technologies [e.g. smFISH (Raj ) and MERFISH (Moffitt )] and non-spatial technologies [e.g. BART-Seq (Uzbas ), HyPR-seq (Marshall ) and 10x-Genomics Targeted Gene Expression]. Compared with scRNA-seq, targeted gene profiling technologies have advantages such as capturing spatial information (by smFISH and MERFISH), having a lower cost per cell (by BART-Seq), and exhibiting a higher sensitivity for detecting lowly expressed genes (by HyPR-seq). However, it remains an open and challenging question to optimize the gene selection for targeted gene profiling under a gene number limitation. Given the importance of informative gene selection, researchers have developed many gene selection methods for scRNA-seq data. Most existing methods select genes based on the relationship between per-gene expression means and per-gene expression variances (with the mean and variance of each gene calculated across cells). Popular example methods include variance stabilization transformation (vst) (Hafemeister and Satija, 2019) and mean–variance plot (mvp) in the R package Seurat (Stuart ), as well as modelGeneVar in the R package scran (Lun ). These methods select highly variable genes that have large expression variances in relation to their expression means. Other methods use various metrics of gene importance instead of the per-gene expression variance. For example, M3Drop selects the genes that have zero expression levels in many cells (Andrews and Hemberg, 2019); GiniClust selects the genes with large Gini indices of expression levels (Jiang ); SCMarker selects the genes that have expression levels bi/multimodally distributed and are coexpressed or mutually exclusively expressed with some other genes (Wang ). A common limitation of these existing methods is that they are all designed to select a relatively large number of genes; thus, their performance for selecting a small number of genes remains unclear. For instance, in Seurat, the default gene number is 2000; SCMarker selects 700–900 genes in its exemplar applications (Wang ). All these gene numbers are much greater than 200, the maximum gene number allowed by multiple targeted gene profiling technologies. Therefore, existing gene selection methods may not be suitable for selecting genes for targeted gene profiling. Another drawback of these methods is that their selected genes lack functional interpretability; that is, their selected genes are not categorized as functional gene groups. In addition to these gene selection methods, linear dimensionality reduction methods, such as principal component analysis (PCA) and non-negative matrix factorization (NMF), can also be used for gene selection. Specifically, genes can be selected based on their contributions to the projected low dimensions found by PCA or NMF (Baron ; Macosko ; Ye and Li, 2016; Zhu ). Although many variants of PCA and NMF algorithms have been developed for scRNA-seq data analysis, they are not designed for gene selection (Boileau ; Duren ; Durif ; Gao and Welch, 2020; Welch ; Yang and Michailidis, 2016; Zhang ). Here, we propose an unsupervised method scPNMF to simultaneously select informative genes and project scRNA-seq data onto an interpretable low-dimensional space. Leveraging the Projective Non-negative Matrix Factorization (PNMF) algorithm (Yuan ), scPNMF combines the advantages of PCA and NMF by outputting a non-negative sparse weight matrix that can project cells in a high-dimensional scRNA-seq dataset onto a low-dimensional space. Unlike the weight matrix (a.k.a., loading matrix) found by PCA, the non-negative sparse weight matrix output by scPNMF corresponds to bases that each corresponds to a group of coexpressed genes. Compared with the original PNMF, a unique feature of scPNMF is basis selection: scPNMF uses correlation screening and multimodality testing to remove the bases that cannot reveal potential cell clusters in the input scRNA-seq dataset. There are two functionalities of scPNMF: (i) given a prespecified gene number and a scRNA-seq dataset, scPNMF selects informative genes based on its weight matrix; (ii) given a targeted gene profiling dataset containing the informative genes, scPNMF projects this dataset onto the same low-dimensional space of a reference scRNA-seq dataset containing cell-type labels, thus enabling cell-type annotation on the targeted gene profiling dataset. A comprehensive benchmark shows that scPNMF outperforms existing gene selection methods in two aspects. First, the informative genes selected by scPNMF lead to the most accurate cell clustering. Second, the informative genes and weight matrix of scPNMF lead to the best cell-type prediction accuracy for targeted gene profiling data. Therefore, scPNMF is a powerful gene selection method that can guide the experimental design and data analysis of single-cell targeted gene profiling.

2 Materials and methods

The core of scPNMF is to learn a low-dimensional embedding of cells so that the bases of the low-dimensional space correspond to sparse and mutually exclusive gene groups, and that genes in each group are coexpressed and thus functionally related. Figure 1 illustrates the workflow of scPNMF. The input of scPNMF is a log-transformed gene-by-cell count matrix measured by scRNA-seq. There are two main steps in scPNMF: (i) it learns a low-dimensional sparse weight matrix by PNMF; (ii) it selects bases in the weight matrix based on functional annotations (optional), correlation screening and multimodality testing to remove uninformative bases that cannot distinguish cell types. The output of scPNMF includes (i) the selected weight matrix, a sparse and mutually exclusive encoding of genes as new, low dimensions and (ii) the score matrix containing embeddings of input cells in the low dimensions. The selected weight matrix has two main applications: extracting informative genes for downstream analyses, such as cell clustering and new marker gene identification, and projecting new targeted gene profiling data for data integration and cell-type annotation.

Fig. 1

An overview of scPNMF. Taking a log-transformed gene-by-cell count matrix as the input, scPNMF first learns a low-dimensional sparse weight matrix W and a low-dimensional cell embedding matrix S. Second, it removes the bases irrelevant to cell-type variations by examining bases’ functional annotations (optional), Pearson correlations with cell library sizes, and multimodality. Given a user-defined gene number M, scPNMF performs M-truncation to facilitate two main applications: (1) selecting the desired number of informative genes; (2) projecting new targeted gene profiling data onto the low-dimensional space defined by reference scRNA-seq data. The details are in the Methods section.

2.1 scPNMF Step I: PNMF

In this section, we review the PNMF algorithm (Yuan ; Yang and Oja, 2010) as the foundation of scPNMF. We first compare the formulation of PNMF with that of PCA and NMF, and we show that PNMF has the advantages of both PCA and NMF so that it can be a useful tool for scRNA-seq data analysis. Next, we introduce our PNMF implementation. Given a log-transformed count matrix , whose p rows correspond to genes and whose n columns represent cells, and a positive integer , PNMF aims to find a K-dimensional space, whose dimensions correspond to non-negative, sparse and mutually exclusive linear combinations of the p genes, so that projecting the n cells onto the K-dimensional space does not cause much information loss (i.e. projecting the K-dimensional embeddings of the n cells back to the original p-dimensional space can largely restore the original n cells). PNMF tackles this task by solving the optimization problem: where denotes the Frobenius matrix norm. The solution W is referred to as a weight matrix. Each column of W is a basis, whose p entries are the weights of the p genes. PNMF requires all weights to be non-negative, leading to a sparse W with most weights as zeros. PCA is similar to PNMF but does not require all weights to be non-negative. We can write the optimization problem of PCA as whose solution W is also a weight matrix but not sparse, and W is often referred to as the loading matrix. A common property of PNMF and PCA is that the transpose of their weight matrix, , can be used to project a new cell with p gene measurements, , onto the K-dimensional space as . In contrast to PNMF and PCA, NMF finds two non-negative matrices W and H so that their product approximates the original matrix X. NMF solves the optimization problem: whose solution W still has K columns representing bases, and H has n columns as K-dimensional embeddings of the n cells. Due to the non-negative constraint on W and H, W is a sparse matrix (Lee and Seung, 1999). However, the transpose cannot be used as a projection matrix from the original p-dimensional space to a K-dimensional space. The reason is that, if is a projection matrix, then by the definition of H we have , which would convert the objective function (3) of NMF to the objective function (1) of PNMF. In other words, PNMF is a constrained version of NMF by requiring to be a projection matrix. Hence, PNMF inherits the property of NMF by having non-negative, sparse bases that are mostly mutually exclusive (i.e., different bases correspond to different gene groups). Moreover, based on the similarities of the objective functions of PNMF (1) and PCA (2), we can see that PNMF also resembles PCA by finding a weight matrix whose transpose can serve as a projection matrix and whose bases are largely orthogonal to each other. Table 1 summarizes the properties of PNMF, PCA and NMF.

Table 1

Comparison of the properties of PNMF, PCA and NMF

	Optimization problem	Non-negativity	Sparsity	Mutual exclusiveness	New data projection
PNMF	minW\|\|X−WWTX\|\| s.t. W≥0	Yes	Very high	Very high	Yes
PCA	minW\|\|X−WWTX\|\| s.t. WTW=I	No	Low	Low	Yes
NMF	minW,H\|\|X−WH\|\| s.t. W,H≥0	Yes	High	High	No

Comparison of the properties of PNMF, PCA and NMF In the context of scRNA-seq data analysis, the above advantages of PNMF lead to an interpretable and useful weight matrix W. First, the high sparsity of W makes each basis (column) depend on only a small set of genes, which has been defined as a meta-gene for NMF (Brunet ). Second, the mutual exclusiveness of W makes different bases correspond to different gene sets, easing the interpretation of bases as meta-genes or functional units. Third, the projection matrix allows the alignment of new data to reference data, thus facilitating cell-type annotation on the new data.Algorithm 1 summarizes the key steps of PNMF implementation in scPNMF. Our implementation mainly follows the two papers that proposed the PNMF algorithm (Yang and Oja, 2010; Yuan ), and we change the initialization of W to the weight matrix found by PCA, , with the absolute value taken on every entry. Our initialization is motivated by the desired orthogonality of bases (i.e. columns of W).

Algorithm 1 Pseudocode of PNMF implementation in scPNMF

Initialize: 1: while not converge do 2: for ; do 3: 4: end for 5: 6: end while Output: With the weight matrix learned by PNMF, we obtain the score matrix , whose K rows correspond to the bases and whose n columns represent the cells. Specifically, the jth column of S is the K-dimensional embedding of the jth cell; the kth row of S, denoted by , contains the scores (i.e. coordinates) of all n cells in the kth basis: where is the kth column of W, . The low rank K needs to be prespecified in PNMF, same as in PCA and NMF. A larger K preserves more information in X but also removes less noise (technical variation of cells that is not of biological interest), impedes the interpretation of W (more bases are more difficult to interpret) and increases the computational burden. To choose K in a data-driven way, we propose an orthogonality measure, which shows that K = 20 is a reasonable choice for multiple scRNA-seq datasets (Supplementary Section S1.1).

2.2 scPNMF Step II: basis selection

The second key step of scPNMF is to select informative bases among the K bases found by PNMF (i.e. columns of W and rows of S) to remove unwanted variations of cells (e.g. variations irrelevant to cell types). The columns of W enjoy high sparsity and mutual exclusiveness; that is, each column contains positive weights corresponding to a unique small set of genes, so it is expected to reflect a certain biological function. However, some biological functions may not be relevant to the cell heterogeneity of interest, for example, cell-type composition. Motivated by this, we propose three strategies for selecting informative bases (columns of W and rows of S): functional annotations (optional), correlations with cell library sizes and tests of multimodality.

2.2.1 Strategy 1: examine bases by functional annotations (optional)

The first, optional strategy is to annotate the biological function(s) of each basis in the weight matrix. For example, scPNMF may apply gene ontology (GO) analysis to the top 10% genes with the highest weights in each basis (column of W) and record the enriched GO terms as the basis’ functional annotation. Then, users with prior knowledge can interpret the functional annotation on each basis and decide whether or not to remove the basis. For example, if the goal is to delineate cell types in scRNA-seq data, a basis corresponding to cell-cycle genes should be removed because they would obscure the distinction of cell types. However, it is worth noting that filtering bases by biological annotations is optional in scPNMF. Conservative users can keep all K bases output by PNMF and directly use data-driven basis selection (Section 2.2.2). For our results in this article, scPNMF removes the bases corresponding to well-known housekeeping genes (Supplementary Section S2).

2.2.2 Data-driven strategies

2.2.2.1 Strategy 2: examine bases by correlations with cell library sizes

Note that the input of scPNMF is a log-transformed unnormalized count matrix for users’ convenience. Hence, scPNMF does not adjust for cell library sizes in the computation of W and S in Step I. (For a detailed discussion on why scPNMF uses unnormalized data as input, see Supplementary Section S6.) Given that the variance of cell library sizes contributes to unwanted variations of cells (Hafemeister and Satija, 2019), it is necessary to remove the bases whose corresponding rows in S are strongly correlated with cell library sizes. We use the total log-transformed counts to approximate the library size of each cell, and we calculate the Pearson correlation between each and the library sizes of n cells. The strategy is to retain the bases whose Pearson correlations are under a pre-defined threshold, which we set to 0.7 based on empirical observations (Supplementary Section S1.2).

2.2.2.2 Strategy 3: examine bases by multimodality tests

Another data-driven strategy is to retain the bases whose corresponding scores are multimodally distributed. If a basis’ score vector (row in S) contains n scores with a multimodality pattern, then it is likely to distinguish cell types and should be retained. To implement this strategy, we use the ACR test (Ameijeiras-Alonso ) to check the multimodality of each basis’ score vector. The null hypothesis is that the score vector contains n scores sampled from a unimodal distribution, and the alternative hypothesis is that the distribution has more than one mode. After performing multiple multimodality tests, one per basis, we use the Benjamini–Hochberg procedure to set a P-value threshold by controlling the false discovery rate under 1%. The bases whose P-values are under this threshold will be retained. In summary, scPNMF step II allows users to use Strategy 1 to filter out uninformative bases based on functional annotations if available; then it implements data-driven Strategies 2 and 3 to further remove bases that have strong correlations with cell library sizes and exhibit unimodality patterns. The retained bases will have their corresponding columns in W selected and stacked into the selected weight matrix , where K0 is the number of selected bases.

2.3 Applications of scPNMF output: informative gene selection and new data projection

The selected weight matrix output by scPNMF has two main applications: selection of a desired number of informative genes and projection of new targeted gene profiling data onto the low-dimensional space defined by . Given a gene number M (e.g. 200), scPNMF uses M-truncation, a step to select M rows in , resulting in M informative genes and a truncated, selected weight matrix for new data projection.

2.3.1 M-Truncation and informative gene selection

We denote the desired number of informative genes by , with . M-truncation has three steps. 3.1. Truncate the selected weight matrix by setting all to be 0; 3.2. Keep the M rows with nonzero entries; stack them by row into based on the order of the informative genes. In short, scPNMF selects informative genes based on their maximum weights in the selected bases. The rationale is that a gene’s maximum weight reflects the gene’s contribution to the establishment of the K0-dimensional space, which preserves the n cells’ biological variations of interest. Hence, genes with larger maximum weights are more informative in the sense of encoding cells’ biological variations. An important application of informative gene selection is to guide the design of targeted gene profiling experiments.

2.3.2 New data projection

Given the selected M informative genes, once new cells are measured by targeted gene profiling on these genes, can be used to project the new cells onto the K0-dimensional space where the cells in the input scRNA-seq data are embedded in. If the input data has cell-type annotations, we refer to the input data as reference data, then we can predict the new cells’ types from the types of the cells in the reference data. In detail, new data projection has the following steps: Apply scPNMF with M-truncation to input, reference data with n cells to obtain the truncated, selected weight matrix . Construct as a submatrix of X, with rows corresponding to the rows of , that is, the M informative genes. Hence, the K0-dimensional embeddings of the n cells in the reference data are the columns of Denote the targeted gene profiling data of new cells with M informative genes measured by . Note that contains log-transformed counts and has rows (genes) corresponding to the rows of . Project the cells to the K0-dimensional space by (Optional) Normalize and to remove batch effects, if existent, by using a single-cell data integration method such as Harmony (Korsunsky ). Now the n reference cells and the new cells are in the same K0-dimensional space with biological variations preserved. Then, a classifier can be trained on the n reference cells’ types and for cell-type prediction, and it can be used to predict the cells’ types from .

3 Results

3.1 scPNMF outputs a sparse and functionally interpretable representation of scRNA-seq data

We first demonstrate that scPNMF Step I, PNMF, outputs a sparse and functionally interpretable gene encoding of cells. We use the FregGold dataset (Freytag ), which consists of three cell types (three human lung adenocarcinoma cell lines), and set the basis number K = 5 for demonstration purpose. Both PCA and PNMF learn a weight matrix that can project the original scRNA-seq data onto a 5D space. Unlike the weight matrix of PCA that has no zero entries, the weight matrix of PNMF is non-negative, highly sparse, containing 42.6% of entries as zeros, and has bases that are largely mutually exclusive (i.e. nonzero entries in different columns correspond to different rows/genes) (Fig. 2a). Compared with NMF, PNMF also has greater sparsity and mutual exclusiveness in bases (Supplementary Section S7). GO enrichment analysis shows that high weight genes in each PNMF basis are enriched with conceptually similar GO terms, and high weight genes in different PNMF bases are enriched with conceptually different GO terms (Fig. 2b). This result indicates that PNMF bases correspond to gene groups with distinct functions. On the contrary, the PCA bases do not have good functional interpretations: the high weight genes in each PCA basis are not enriched with conceptually similar GO terms, and different PCA bases share many high weight genes (Supplementary Fig. S8).

Fig. 2

Illustration of the sparse and interpretable projection found by scPNMF. We use the FregGold dataset as an example. (a) Comparison of the weight matrices of PCA and PNMF. Heatmaps visualize the learned weight matrices of PCA (top) and PNMF (bottom), where rows are genes and columns are bases. Red represents positive weights while blue represents negative weights. The rows are ordered by gene-wise hierarchical clustering. Compared to PCA, the weight matrix of PNMF is strictly non-negative, much more sparse and mutually exclusive between bases. (b) GO analysis result of each basis in the weight matrix of PNMF. Texts in black boxes summarize the functions of genes in each basis. The enriched GO terms are almost mutually exclusive, implying that each basis represents a unique gene functional cluster. (c) Statistical tests on each basis in the score matrix of PNMF. Top row: scatter plots of scores and total log-counts (cell library sizes). Each dot represents a cell. Cell scores in bases 1 and 4 are highly correlated with cell library sizes. Bottom row: histograms of cell scores in each basis. Scores in bases 2 and 3 show strong multimodality patterns (adjusted P-value ). (d) UMAP visualizations of cells based on high weight genes in the unselected bases 1 and 4 and those in the selected bases 2, 3 and 5. Genes in the unselected bases completely fail to distinguish the three cell types, while genes in the selected bases lead to a clear separation of the three cell types. To further analyze the PNMF bases, we list the top 10 high weight genes in each basis (Supplementary Table S2), from which we identify many well-known genes with important functions. For instance, basis 1 contains classic housekeeping genes, such as GAPDH (Barber ) and ribosomal protein genes (RP-) (Silver ); basis 3 contains well-known tumor-related genes, including EGFR (Blakely ) and CDK4 (O’Leary ). In particular, the cells of the HCC827 cell line (one of the three cell types) have overall high scores in basis 3 (Supplementary Fig. S9), a reasonable result because the HCC827 cell line contains an EGFR activating mutation (Della Corte ). In summary, scPNMF step I outputs bases representing sparse and functionally interpretable gene sets.

3.2 Basis selection is an essential step in scPNMF

Here, we explain why basis selection is an essential step in scPNMF. In the last section, we show that each PNMF basis of the FregGold dataset approximately represents one functional gene group. It is well known that housekeeping genes (basis 1) and cell-cycle genes (basis 4) are usually irrelevant to cell-type distinctions. However, such biological knowledge is not always available or certain. Therefore, scPNMF mainly relies on the two data-driven strategies: correlations with cell library sizes and multimodality tests (Section 2.2.2) for selecting informative bases. Figure 2c visualizes the two strategies: cell scores in bases 1 and 4 are highly correlated with cell library sizes (Pearson correlations > 0.9); cell scores in bases 2 and 3 show strong evidence as multimodally distributed (adjusted P-value < 0.05). Hence, Strategy 1 will not retain bases 1 and 4, and Strategy 2 will not retain bases 1, 4 and 5; together, bases 1 and 4 will be removed, and bases 2, 3 and 5 will be selected. To verify the effectiveness of basis selection, we use UMAP to visualize cells based on the top 50 high weight genes in the unselected bases 1 and 4 versus those in the selected bases 2, 3 and 5 (Fig. 2d). We observe that the top genes in the unselected bases completely fail to separate the three cell types, while the top genes in the selected bases perfectly distinguish the three cell types. This result strongly supports that basis selection is a necessary step of scPNMF. If cell-type labels are provided, users may use a strategy alternative to ‘correlations with cell library sizes’ by regressing out the cell library sizes in a cell-type-specific manner from every basis (see Supplementary Section S6).

3.3 scPNMF outperforms state-of-the-art gene selection methods on diverse scRNA-seq datasets

In this section, we demonstrate scPNMF’s capacity for informative gene selection. We comprehensively benchmark scPNMF against 11 other single-cell informative gene selection methods (Supplementary Table S3) on seven scRNA-seq datasets (Supplementary Table S4) using three clustering methods (Louvain clustering, K-means clustering and hierarchical clustering). For fair benchmarking, the seven scRNA-seq datasets cover both unique molecule identifier (UMI) and non-UMI protocols and include various biological samples. Using the adjusted Rand index (ARI) as the metric of clustering accuracy, we calculate the ARI values of the three clustering methods on each dataset using 100 informative genes selected by each gene selection method, as 100 genes are commonly used in targeted gene profiling. Figure 3a shows that scPNMF has overall the highest ARI values across datasets and clustering methods. In particular, scPNMF has the highest average ARI value with each clustering method (Louvain: 0.83; K-means: 0.74; hierarchical clustering: 0.69) and the highest overall average ARI (0.75) across datasets and clustering methods. Note that the mean of the overall average ARI values of all methods except scPNMF is only 0.66.

Fig. 3

Benchmarking scPNMF against 11 informative gene selection methods on seven scRNA-seq datasets. (a) Clustering accuracies (ARI values) of three clustering methods based on the informative genes selected. Gene selection methods are ordered from left to right by their average ARI across the three clustering methods and the seven datasets. (b) UMAP visualization of cells in the Zheng4 dataset based on 100 informative genes selected by each method. Genes selected by scPNMF lead to a clear separation between naive cytotoxic T cells and regulatory T cells, while the genes selected by others methods do not. We further show the UMAP visualization of cells in the Zheng4 dataset based on the informative genes selected by each of the 12 gene selection methods (Fig. 3b). Only scPNMF leads to a clear separation of naive cytotoxic T cells and regulatory T cells, while the informative genes selected by other methods except corFS and irlbaPcaFS cannot distinguish the two cell types at all. We also compare the 12 methods under a varying number of informative genes: 20, 50, 200 and 500, the commonly used gene numbers in targeted gene profiling. We observe that the overall average ARI values of scPNMF are consistently higher than those of other methods, across all informative gene numbers (Supplementary Fig. S11). We apply the same benchmarking framework to scPNMF and its variant, where PNMF is replaced by NMF, and find that scPNMF performs consistently better (Supplementary Section S7). Moreover, compared with other methods, scPNMF leads to more stable overall average ARI values under varying numbers of informative genes, indicating its stronger robustness to the gene number constraint of targeted gene profiling. These results strongly support the superior performance of scPNMF as an informative gene selection method.

3.4 scPNMF guides targeted gene profiling experimental design and cell-type prediction

In this section, we demonstrate how scPNMF can guide the selection of genes to be measured in a targeted gene profiling experiment, and how scPNMF enables subsequent cell-type annotation on the targeted gene profiling data. We design two case studies with paired scRNA-seq reference data and ‘pseudo’ targeted gene profiling data, whose per-cell sequencing depth is higher than that of the corresponding scRNA-seq data. In the first case study, we use the Zheng8 dataset (measured by the 10x Genomics protocol) as the reference dataset. To generate the pseudo targeted gene profiling data, we use a new single-cell gene expression simulator that captures gene correlations, scDesign2 (Sun ), to generate data with a 100-time higher per-cell sequencing depth. In the second case study, we use the PBMC10x dataset (measured by 10x Genomics protocol) as the reference dataset, and we use PBMCSmartseq (measured by Smart-Seq2) as the pseudo targeted gene profiling data because Smart-Seq2 has a higher per-gene sequencing depth than 10x Genomics does. In both case studies, for each gene selection method, the corresponding pseudo targeted gene profiling datasets only contain the M informative genes selected by the method. We benchmark scPNMF against the 11 gene selection methods in terms of cell-type prediction on the pseudo targeted gene profiling data. To avoid the bias for a specific classification algorithm, we apply three popular algorithms for cell-type prediction: random forest (Breiman, 2001), k-nearest neighbors (KNN) (Boser ) and support vector machine (SVM) (Boser ). In each case study, we first train each classification algorithm on the low-dimensional embeddings of the reference cells given the M = 100 informative genes selected by each gene selection method. Then, we apply the trained classifier to the low-dimensional embeddings of the cells in the pseudo targeted gene profiling data . Table 2 shows that scPNMF leads to the highest average prediction accuracy (0.81) across six combinations (two case studies × three classification algorithms). Moreover, scPNMF achieves the highest accuracy in each combination except Zheng8 + random forest where it is the second best. These results confirm that scPNMF effectively guides the selection of genes to measure in targeted gene profiling experiments, and it enables accurate cell-type annotation on newly generated targeted gene profiling datasets.

Table 2.

Prediction accuracy of cell types based on 100 informative genes selected by 12 gene selection methods in the two case studies with paired reference scRNA-seq data and targeted gene profiling data

Method	Zheng8			PBMC			Average Accuracy
	RandomForest	KNN	SVM	RandomForest	KNN	SVM
scPNMF	0.85 (0.83,0.87)	0.80 (0.78,0.83)	0.87 (0.85,0.89)	0.84 (0.79,0.88)	0.84 (0.79,0.88)	0.67 (0.61,0.73)	0.81
M3Drop	0.85 (0.83,0.87)	0.80 (0.77,0.83)	0.87 (0.84,0.89)	0.84 (0.79,0.88)	0.77 (0.71,0.82)	0.63 (0.57,0.69)	0.79
SeuratDISP	0.84 (0.81,0.86)	0.78 (0.75,0.81)	0.86 (0.84,0.88)	0.80 (0.75,0.84)	0.75 (0.70,0.80)	0.64 (0.58,0.70)	0.78
corFS	0.80 (0.77,0.82)	0.75 (0.73,0.78)	0.82 (0.80,0.85)	0.82 (0.77,0.86)	0.81 (0.76,0.86)	0.62 (0.56,0.68)	0.77
GiniClust	0.86 (0.83,0.88)	0.79 (0.76,0.81)	0.86 (0.83,0.88)	0.80 (0.75,0.84)	0.76 (0.71,0.81)	0.53 (0.47,0.60)	0.75
scran	0.79 (0.76,0.81)	0.72 (0.69,0.75)	0.82 (0.80,0.85)	0.78 (0.72,0.82)	0.73 (0.67,0.78)	0.67 (0.61,0.72)	0.75
SeuratMVP	0.83 (0.81,0.85)	0.77 (0.74,0.80)	0.85 (0.82,0.87)	0.82 (0.77,0.86)	0.74 (0.69,0.79)	0.47 (0.40,0.53)	0.74
Scanpy	0.79 (0.77,0.82)	0.71 (0.68,0.74)	0.80 (0.78,0.83)	0.80 (0.75,0.84)	0.76 (0.71,0.81)	0.52 (0.46,0.58)	0.73
SCMarker	0.77 (0.74,0.79)	0.68 (0.65,0.71)	0.74 (0.71,0.77)	0.77 (0.71,0.81)	0.71 (0.65,0.76)	0.45 (0.39,0.52)	0.69
SeuratVST	0.73 (0.70,0.76)	0.68 (0.65,0.71)	0.75 (0.73,0.78)	0.74 (0.68,0.79)	0.68 (0.63,0.74)	0.40 (0.34,0.46)	0.67
DANB	0.71 (0.68,0.73)	0.69 (0.66,0.71)	0.75 (0.73,0.78)	0.73 (0.67,0.78)	0.74 (0.68,0.79)	0.28 (0.23,0.34)	0.65
irlbaPcaFS	0.68 (0.65,0.71)	0.61 (0.58,0.64)	0.71 (0.68,0.74)	0.71 (0.65,0.76)	0.77 (0.71,0.82)	0.16 (0.12,0.21)	0.61

Parentheses are 95% confidence intervals. Highest number within each column is labeled by boldface and underline.

Prediction accuracy of cell types based on 100 informative genes selected by 12 gene selection methods in the two case studies with paired reference scRNA-seq data and targeted gene profiling data Parentheses are 95% confidence intervals. Highest number within each column is labeled by boldface and underline.

4 Discussion

We propose scPNMF, an unsupervised gene selection and data projection method for scRNA-seq data. The major goal of scPNMF is to select a fixed number of informative genes to distinguish cell types and guide gene selection for targeted gene profiling experiments. Moreover, scPNMF can project a new targeted gene profiling dataset with the selected genes to the low-dimensional space that embeds a reference scRNA-seq dataset. We perform a comprehensive benchmark to evaluate scPNMF in terms of informative gene selection against the state-of-the-art gene selection methods. Our results show that scPNMF consistently outperforms 11 existing methods for a wide range of informative gene numbers (from 20 to 500) on diverse scRNA-seq datasets. We also demonstrate that the informative genes selected by scPNMF can effectively guide gene selection for targeted gene profiling and lead to accurate cell-type annotation on targeted gene profiling data based on reference scRNA-seq data. In addition to the 11 methods, we compare scPNMF to the factorial single-cell latent variable model (f-scLVM) (Buettner ), both conceptually and empirically, to clarify their differences and further illustrate the unique strength of scPNMF (see Supplementary Section S8). Besides gene selection and data projection, scPNMF also works as a dimensionality reduction method with good interpretability. Each dimension in the low-dimensional space found by scPNMF can be considered as a new functional ‘feature’ (as a linear combination of correlated and thus functionally related genes). Moreover, the mutual exclusiveness makes the PNMF bases used in scPNMF advantageous over the PCA bases in terms of removing confounding effects. For example, cell-cycle genes obscure the identification of cell types and should be removed from low-dimensional embeddings of cells. For PCA, cell-cycle genes affect many PCA bases, so the popular scRNA-seq pipeline Seurat implements a complicated approach that first calculates ‘cell-cycle scores’ and then regresses each basis (principal component) on these scores to remove the effects of cell-cycle genes (Stuart ). In contrast, cell-cycle genes are concentrated in only one PNMF basis, so it is easy to remove that basis to clear the effects of cell-cycle genes. Therefore, scPNMF has great potentials in deciphering cell heterogeneity in single-cell data by working as an interpretable dimensionality reduction method. The current implementation of scPNMF focuses on single-cell gene expression data. Considering the rapid development of single-cell multiomics technologies, we plan to extend scPNMF to accommodate other technologies that measure other genomics features such chromatin accessibility landscapes measured by single-cell ATAC-seq (Pott and Lieb, 2015), or even to integrate data across multiomics datasets. Another note is that the multimodality test for basis selection in scPNMF only accounts for discrete cell types but not continuous cell trajectories. Therefore, other tests or strategies are needed to select informative bases to capture biological variations along continuous cell trajectories. An important question for gene selection is: how many genes should be selected as informative genes to fully capture the biological variations of interest? In our studies, we observe that, after the informative gene number reaches 200, the clustering accuracies based on the selected informative genes plateau for most gene selection methods including scPNMF. Therefore, 200 genes may be sufficient for capturing biological variations in scRNA-seq data. However, it remains challenging to decide the minimum number of informative genes, given that the underlying cell subpopulation structure is data-specific and might be complex. We plan to explore this problem in the future with the possible use of information theory. Financial Support: This work was supported by the following grants: National Science Foundation DBI-1846216, NIH/NIGMS R01GM120507, Johnson & Johnson WiSTEM2D Award, Sloan Research Fellowship, and UCLA David Geffen School of Medicine W.M. Keck Foundation Junior Faculty Award (to J.J.L); NIH/NINDS R01NS117148 (to R.W). Conflict of Interest: none declared. Click here for additional data file.

39 in total

1. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets.

Authors: Evan Z Macosko; Anindita Basu; Rahul Satija; James Nemesh; Karthik Shekhar; Melissa Goldman; Itay Tirosh; Allison R Bialas; Nolan Kamitaki; Emily M Martersteck; John J Trombetta; David A Weitz; Joshua R Sanes; Alex K Shalek; Aviv Regev; Steven A McCarroll
Journal: Cell Date: 2015-05-21 Impact factor: 41.582

2. Probabilistic count matrix factorization for single cell expression data analysis.

Authors: Ghislain Durif; Laurent Modolo; Jeff E Mold; Sophie Lambert-Lacroix; Franck Picard
Journal: Bioinformatics Date: 2019-10-15 Impact factor: 6.937

3. Single-cell multimodal omics: the power of many.

Authors: Chenxu Zhu; Sebastian Preissl; Bing Ren
Journal: Nat Methods Date: 2020-01 Impact factor: 28.547

4. GAPDH as a housekeeping gene: analysis of GAPDH mRNA expression in a panel of 72 human tissues.

Authors: Robert D Barber; Dan W Harmer; Robert A Coleman; Brian J Clark
Journal: Physiol Genomics Date: 2005-03-15 Impact factor: 3.107

5. scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured.

Authors: Tianyi Sun; Dongyuan Song; Wei Vivian Li; Jingyi Jessica Li
Journal: Genome Biol Date: 2021-05-25 Impact factor: 13.583

6. Imaging individual mRNA molecules using multiple singly labeled probes.

Authors: Arjun Raj; Patrick van den Bogaard; Scott A Rifkin; Alexander van Oudenaarden; Sanjay Tyagi
Journal: Nat Methods Date: 2008-09-21 Impact factor: 28.547

7. Fast, sensitive and accurate integration of single-cell data with Harmony.

Authors: Ilya Korsunsky; Nghia Millard; Jean Fan; Kamil Slowikowski; Fan Zhang; Kevin Wei; Yuriy Baglaenko; Michael Brenner; Po-Ru Loh; Soumya Raychaudhuri
Journal: Nat Methods Date: 2019-11-18 Impact factor: 28.547

8. Selection of housekeeping genes for gene expression studies in human reticulocytes using real-time PCR.

Authors: Nicholas Silver; Steve Best; Jie Jiang; Swee Lay Thein
Journal: BMC Mol Biol Date: 2006-10-06 Impact factor: 2.946

9. NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data.

Authors: Yuting Ye; Jingyi Jessica Li
Journal: BMC Genomics Date: 2016-01-11 Impact factor: 3.969

10. Evolution and clinical impact of co-occurring genetic alterations in advanced-stage EGFR-mutant lung cancers.

Authors: Collin M Blakely; Thomas B K Watkins; Wei Wu; Beatrice Gini; Jacob J Chabon; Caroline E McCoach; Nicholas McGranahan; Gareth A Wilson; Nicolai J Birkbak; Victor R Olivas; Julia Rotow; Ashley Maynard; Victoria Wang; Matthew A Gubens; Kimberly C Banks; Richard B Lanman; Aleah F Caulin; John St John; Anibal R Cordero; Petros Giannikopoulos; Andrew D Simmons; Philip C Mack; David R Gandara; Hatim Husain; Robert C Doebele; Jonathan W Riess; Maximilian Diehn; Charles Swanton; Trever G Bivona
Journal: Nat Genet Date: 2017-11-06 Impact factor: 38.330

2 in total

1. geneBasis: an iterative approach for unsupervised selection of targeted gene panels from scRNA-seq.

Authors: Alsu Missarova; Jaison Jain; Andrew Butler; Shila Ghazanfar; Tim Stuart; Maigan Brusko; Clive Wasserfall; Harry Nick; Todd Brusko; Mark Atkinson; Rahul Satija; John C Marioni
Journal: Genome Biol Date: 2021-12-06 Impact factor: 13.583

2. Genomic and transcriptomic profiling reveals distinct molecular subsets associated with outcomes in mantle cell lymphoma.

Authors: Shuhua Yi; Yuting Yan; Meiling Jin; Supriyo Bhattacharya; Yi Wang; Yiming Wu; Lu Yang; Eva Gine; Guillem Clot; Lu Chen; Ying Yu; Dehui Zou; Jun Wang; An T Phan; Rui Cui; Fei Li; Qi Sun; Qiongli Zhai; Tingyu Wang; Zhen Yu; Lanting Liu; Wei Liu; Rui Lyv; Weiwei Sui; Wenyang Huang; Wenjie Xiong; Huijun Wang; Chengwen Li; Zhijian Xiao; Mu Hao; Jianxiang Wang; Tao Cheng; Silvia Bea; Alex F Herrera; Alexey Danilov; Elias Campo; Vu N Ngo; Lugui Qiu; Lili Wang
Journal: J Clin Invest Date: 2022-02-01 Impact factor: 19.456

2 in total