Literature DB >> 34141142

scDA: Single cell discriminant analysis for single-cell RNA sequencing data.

Qianqian Shi¹, Xinxing Li¹, Qirui Peng¹, Chuanchao Zhang², Luonan Chen^2,3,4.

Abstract

Single-cell RNA-sequencing (scRNA-seq) techniques provide unprecedented opportunities to investigate phenotypic and molecular heterogeneity in complex biological systems. However, profiling massive amounts of cells brings great computational challenges to accurately and efficiently characterize diverse cell populations. Single cell discriminant analysis (scDA) solves this problem by simultaneously identifying cell groups and discriminant metagenes based on the construction of cell-by-cell representation graph, and then using them to annotate unlabeled cells in data. We demonstrate scDA is effective to determine cell types, revealing the overall variabilities between cells from eleven data sets. scDA also outperforms several state-of-the-art methods when inferring the labels of new samples. In particular, we found scDA less sensitive to drop-out events and capable to label a mass of cells within or across datasets after learning even from a small set of data. The scDA approach offers a new way to efficiently analyze scRNA-seq profiles of large size or from different batches. scDA was implemented and freely available at https://github.com/ZCCQQWork/scDA.

Entities: Chemical Disease Gene Species

Keywords: Cell annotation; Cell-by-cell representation graph; Discriminant analysis; Discriminant features; Single-cell RNA-sequencing

Year: 2021 PMID： 34141142 PMCID： PMC8187165 DOI： 10.1016/j.csbj.2021.05.046

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Single-cell RNA sequencing profiles gene expression of individual cells, allowing to detailly characterize multicellular organisms than bulk RNA-seq [1], [2], [3]. While, given a scRNA-seq transcriptomic data set, one challenge is to unsupervisedly find out distinguishing features (or genes) for different cell populations [4], [5]. In particular, reliable features would greatly improve efficiency to annotate large data sets or newly profiled cells [5], [6]. Recently, a popular way is a two-step schema [7], i.e., firstly identifying cell groups based on clustering approaches, then obtaining discriminant genes between these cell groups. These genes are usually referred to as marker genes [5]. For instance, the pipeline of SC3 [8] is to first infer cell clusters, and then conduct binary comparisons (i.e., one cluster vs the remaining cells) for selecting cluster-specific markers. SparseDC [9] model can extract diverse types of marker genes with known cell labels based on two-sample statistical theory. Unfortunately, this schema roughly defines a stepwise solution, which ignores the inherent correlations between genes and cells. Both results may be overfull dependent on cell clustering and easily lead to biased analysis. Meanwhile, group-level comparisons construct a separating line or hyper-plane in the original data but are difficult to capture the crucial information underlying heterogeneous structures. This could largely depress their performances to discriminate cells from new data sets. Such issues thus turn the attention to those feature extraction methods, which enable to discover a set of informative genes (or metagenes) by preserving the supportive structure inherent in data. Actually, many models are ever developed and widely used for analogous problems in bulk RNA-seq [10], [11]. For example, principal component analysis (PCA), is directly borrowed to analyze scRNA-seq data sets [12], or modified to handle noisy data (e.g., dropouts) in experiments [13]. Considering high heterogeneity among cells, some nonlinear methods, such as t-SNE [14] and SNN-Cliq [15], were also proposed. They aim to preserve local variations of data in a lower dimensional feature space and help improve the visualization of natural cell groups, appearing more viable for scRNA-seq profiles. However, there are two main shortcomings in these involved approaches: (i) they only retain part of the data structures in new feature space, e.g., global linear structure (e.g., PCA) or neighborhood relationship (e.g., t-SNE), which cannot take advantage of the full population heterogeneity; and (ii) consequently the obtained features are usually ineffective for annotating new samples. In most cases, these limitations further require costly re-analysis when handling merged data sets, or may take more time and memories with large amount of cells (e.g., t-SNE). To address such challenges in real applications, we propose a novel model, which can unsupervisedly capture the overall relationship of cells and meanwhile obtain powerful discriminants (metagenes) from scRNA-seq data sets. In particular, the cell-by-cell relationship can be expressed as representations of multiple neighboring cells [16], [17], which provides information on the gross (global and local) variabilities among the inherent cell types (Supplementary note 1). The representation graph in our method is more robust to noise and data heterogeneity in profiles [16], rather than the pair-wise distances. Moreover, the metagenes are also identified to best fit in with the full representation configuration. These features carry optimal discriminative information, and can be used to discriminate unlabeled cells from new dataset. With the model, we made the analytic pipeline of single cell discriminant analysis (scDA) for scRNA-seq data (Fig. 1). We separate the whole data set into two parts: discovery (or old) set and validation (or new) set. On this basis, scDA implements two main steps: identify data structures (cell quantitative affinities and cell clusters) and discriminant features (metagenes) with the discovery set, and then use them to label the cells in the validation set. Thus, scDA can avoid unnecessary re-clustering, and is actually a combinational approach simultaneously performing both clustering and classification. We demonstrate the effectiveness and efficiency of our scDA on eleven scRNA-seq data sets involving different biological systems and technologies.

Fig. 1

Illustrative scheme of scDA. (a) Identify cell affinities and discriminants from profiles of discovery cohort. Matrix Z and P respectively denote representation matrix and discrimination matrix, which are solved through the formula using expression matrix of X. The representation matrix can be converted to similarity matrix, then used to define cell clusters in discovery data set. (b) Classify new samples from validation cohort (i.e., X*) based on the obtained cell clusters and discriminants. Red stars highlight scDA implemented in two available ways: (i) unsupervised mode only with scRNA-seq dataset; (ii) supervised mode with data and given labels of cells in training set. In supervised mode, the prior cell annotations can help constrain scDA model either to obtain features (i.e., representation and discrimination matrices) or build classifiers (i.e., with the discriminant vectors and cell annotations). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Materials and methods

Processing of scRNA-seq expression profiles

Ten scRNA-seq data sets were selected for method evaluation in total (Table 1); we kept the originally normalized data from corresponding researches. All these datasets were log-transformed after adding a pseudo count of 1, and further preprocessed by different approaches for different feature measurements.

Table 1

Datasets for scDA evaluation.

Datasets	k	N	Unit	Description	Protocol	Reference
Camp	7	777	FPKM	Human liver bud	SMARTer	Ref. [33]
Goolam	5	124	CPM	Mouse embryo	Smart-Seq2	Ref. [34]
Li	9	561	FPKM	Human cell line	SMARTer	Ref. [35]
Pollen	11	301	TPM	Human cell line	SMARTer	Ref. [36]
Lawlor	6	597	RSEM	Human pancreas	Smart-Seq2	Ref. [37]
Segerstolpe	6	1,812	RPKM	Human pancreas	Smart-Seq2	Ref. [38]
Muraro	6	1,940	Count	Human pancreas	CEL-Seq2	Ref. [39]
Enge	5	2,174	Count	Human pancreas	Smart-Seq2	Ref. [40]
Baron	6	7,742	Count	Human pancreas	inDrop	Ref. [41]
Galen	15	4,677	CPM	Human bone marrow aspirate	Seq-Well	Ref. [42]
1 M	20	1,306,127	Count	Mouse brain	10X Genomics Chromium	–

k: number of groups provided by the authors.

N: number of cells.

Datasets for scDA evaluation. k: number of groups provided by the authors. N: number of cells. With the four datasets from Camp to Pollen, we considered the effect of dropouts on clustering and classification, and generated series of corresponding data subsets with different numbers of genes. Since the sparsity rate is empirical and uncertain, we simply filtered out those genes when a gene has more than a certain proportion of zeros (i.e., 0.1 ∼ 0.9 by 0.1) across cells. Thus, these benchmarks were expanded to data sets. With pancreas datasets, we tested clustering and classification performances of scDA within and across datasets. For the first purpose, we generated two large datasets, namely Large1 and Large2. Large1 is a mix of 4 profiles including Lawlor, Segerstolpe, Muraro and Enge. Large2 is Baron. The raw data sets were firstly preprocessed, including quality control and normalization. Then Large1 sets are further batch corrected. All the process steps are similar as Haghverdi’s work[18] except that we use all the genes rather than highly variable genes when removing batch effects of Large1. We abandoned these ‘Unknown’ cells and retained the other six types of cells in all. We applied scDA to Large1 under sparsity rate of 0.3 and Large2 under 0.1 to make a similar number of genes ∼ 10,000. The specialized experiments can be used to test the robustness of scDA as well as to evaluate intra-dataset classification performances. For the evaluation of inter-dataset classification performances, we keep all the common (i.e., 15,446) genes across the five pancreas datasets for fair comparison instead. We also used the Galen data for classification evaluation across different subjects. The whole amount of genes, i.e., 27,672, passing the filtering thresholds in the original paper are used in our work. 1 M dataset (obtained from https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons) is mainly used to test time efficiency and computing capacities of scDA, where top 2000 highly variable genes are remained for simplicity.

Method overview

Given the processed scRNA-seq data (e.g., from discovery set), scDA could construct two matrices, defined as the ‘representation matrix’ and ‘discrimination matrix’ respectively, to accomplish two tasks: characterizing cell types, i.e. clustering in discovery set (Fig. 1a), and predicting new cells in unannotated set (Fig. 1b) based on the model-determined or prior-given labels, i.e., identified in unsupervised and supervised modes (details in 2.3, 2.5). The ‘representation matrix’ describes the inherent relationship of cells, available for cell clustering, and the ‘discrimination matrix’ contains representation-learning-based features, qualified for classification. In fact, the key issue of scDA approach is to solve the two matrices that involves three basic assumptions: (i) different cell populations belong to distinct biological subspaces or processes, (ii) each cell can be mathematically represented by other cells with similar biological properties, and (iii) the intrinsic heterogeneity as described in (i)-(ii) can be approximately reconstructed on a common feature subspace. Under assumptions (i) and (ii), which cover global differences and local similarities of underlying data, the cell-by-cell representation graph can be built based on expression profiles by subspace segmentation theory [17], [19]. It presents as a sparsely filled matrix, where the coefficients are equal to zero if cells are faraway or in different clusters (Supplementary note 1), and reveals implicit subspaces of various cell populations. Note that representation-based measurement not only describes cell affinities as well as pairwise metrics, but also has robust performances against data biases or noises from scRNA-seq experiments (Fig. 2 and Supplementary Figs. S1–S4). Broadly, the graph quantitatively defines intrinsic heterogeneity from cells, which is of great importance to obtain discriminant features.

Fig. 2

Comparative performances of scDA approach to existing methods. (a, b) Adjusted Rand index (ARI) values of clustering (a) and classification (b) for small data subsets in scDA supervised mode at different sparsity rates, e.g., from 0.8 to 0.2 by 0.2. Performances at other sparsity rates are seen in Supplementary Figs. 1 and 2. Clustering and classification are respectively abbreviated to clu and cla, which are used as footnotes in Fig. 2, Fig. 4 and Supplementary Figs. 1–3, 5, 9, 10, 13. Bars with zero height indicate NA values. Error bars in (b) indicate upper 95% confidence interval of cross-validation results.

Fig. 4

Application of scDA on two large data sets. (a) Clustering (for discovery cohort) and classification (for validation cohort) performances in unsupervised manner with trained datasets of different sizes. For each data set, random sampling was repeated 100 times to construct the discovery sets at certain sampling rates, depicting the percentage of cell amount in total. The validation set is the rest dataset with removal of discovery set. (b) ARI difference (i.e., mean s.e.) between clustering and classification performances from Fig. 4a. (c) Classification performance with construction of classifiers using reference labels. The dashed gray line indicates ARI of 0.9. (d) Enrichment results of marker genes with discriminant vectors from Large1 pancreas dataset. Marker genes in Fi. 4 and Supplementary Fig. S12 are obtained from CellMarker database [28]. The top 5% dimensions (or discriminant vectors) of discrimination matrix is used for enrichment analysis. Discriminant vectors were ascendingly sorted according to the eigen values. The visual example is selected owing to the highest clustering accuracy at sampling rate of 8% with Large1 dataset (k = 40). Each dot indicates enrichment significance.

Characteristics of representation and discrimination matrices from scDA with small data sets. (a, b) Illustration of representation matrices (a) and projected data (b) from Fig. 2a. Column-side colors indicate the reference cell types provided by the original authors. Sample orders are the same in (a) and (b). The optimal groups estimated by the largest eigen gap are separated by black vertical lines in (b). The top discriminants (eigen vectors) occupying 5% of the total number of involved cells, were selected and marked according to eigen values. (c, d) Discriminative abilities of the example discriminants to separate model-identified cell populations between the most distant pairs (c) or neighboring groups (d). DDS, discrimination score for distant groups; NDS, discrimination score for neighboring groups (see Methods). Discrimination scores smaller than 0 are not shown. Fitness curves were created under ‘loess’ regression. P-value was calculated using Wilcoxon test. Application of scDA on two large data sets. (a) Clustering (for discovery cohort) and classification (for validation cohort) performances in unsupervised manner with trained datasets of different sizes. For each data set, random sampling was repeated 100 times to construct the discovery sets at certain sampling rates, depicting the percentage of cell amount in total. The validation set is the rest dataset with removal of discovery set. (b) ARI difference (i.e., mean s.e.) between clustering and classification performances from Fig. 4a. (c) Classification performance with construction of classifiers using reference labels. The dashed gray line indicates ARI of 0.9. (d) Enrichment results of marker genes with discriminant vectors from Large1 pancreas dataset. Marker genes in Fi. 4 and Supplementary Fig. S12 are obtained from CellMarker database [28]. The top 5% dimensions (or discriminant vectors) of discrimination matrix is used for enrichment analysis. Discriminant vectors were ascendingly sorted according to the eigen values. The visual example is selected owing to the highest clustering accuracy at sampling rate of 8% with Large1 dataset (k = 40). Each dot indicates enrichment significance. While, it is usually considered that the number of features inherent in data is much smaller than the total number of profiled genes [5], [18]. As assumed in (iii), our dimensionality reduction is constrained by the obtained representation structure (Supplementary note 2). Based on metric-learning theory [20], this approach aims to project the original data onto a subspace which best fits in with the detailed configuration of within- and between- groups [21]. Therefore, the projection matrix contains the information of underlying cell differences and has more discriminating power on new objects than PCA or t-SNE. However, it is generally impossible to simultaneously obtain the optimal solutions of the two matrices. Inspired by the phenotypic and feature (metagene) correlations, here we design an alternately iterative optimization algorithm, which is guaranteed to converge theoretically [17], [19]. Our method is robust to a variety of the hyperparameter settings (Supplementary Figs. 5 and 6). We emphasize that the whole solving process forms a data-driven closed loop to alternately compute the cell-by-cell and feature-by-gene matrices until it reaches convergence. The loop ensures scDA could unbiasedly derive inherent information from (discovery) data sets. Then, with the optimal representation matrix, scDA is capable to estimate the involved cell types through a graph-based clustering method, e.g., spectral clustering [22]; and classify the unlabeled cells to the acquired assignments based on discriminant vectors, e.g., KNN classifier. The hybrid approach dispenses with re-analysis, appearing more feasible to deal with large amount of cells within or across datasets. We also provide supervised mode to accommodate valuable prior-knowledge from experiencers to obtain more reliable annotations. Regardless of running modes, the performance of scDA is considered to be closely related to the characteristics of these two matrices.

Optimization problem of scDA

Suppose we describe a scRNA-seq data with g biological measurements and N samples as a matrix . scDA could learn the representation matrix and the discrimination matrix from the input data matrix by solving the following objective function:where is an matrix containing all the coefficient vectors of . represents the nuclear norm of the matrix, which is the sum of singular values. The projection matrix denotes the d discrimination vectors transformed from all the profiled genes and -norm is used to measure the reconstructive error matrix due to its robustness [19]. In the constraint conditions, denotes the identity matrix, indicating is an orthogonal subspace without redundancy. Here 1 is an all-one vector for normalization. And is the complement of , where is a set of edges between the samples in a predefined adjacency graph. For example, if and are not neighbors, then we have. Thus, the third constraint can guarantee the preservation of local structure underlying data. In unsupervised mode, we use K nearest neighbor (KNN) algorithm with pair-wise Euclidean distances to determine the sample adjacency graph; in supervised mode, the local constraint is fully or partly determined by prior-knowledge, e.g., known cell labels. While, parameter K is actually free to set and besides it, the equation (1) have a tuning parameter to balance the two terms. They both can be selected according to data properties, or settled empirically. Next, we focus on the solution of the objective function. Considering either of the objective terms can be solved separately, we design an alternately iterative algorithm to compute one matrix by fixing the other one. The splitting problems can then be seen as follows. (1) Solving the optimal matrix by fixing Let’s convert the Eq. (1) to an equivalent form when is fixed:where presents the reconstructive error matrix. It can be solved via modified low-rank representation algorithm [17] (Supplementary Note 1). (2) Solving the optimal matrix by fixing When is given, the substantial problem becomes as: The problem (3) can be solved using -norm minimization technique [23] (Supplementary Note 2).

scDA clustering for discovery set

Given representation matrix , the corresponding affinity matrix is obtained by . With the affinity graph, we used spectral clustering algorithm (e.g., RatioCut [24]) to identify the underlying groups of cells. And the number of groups, i.e., k, is estimated in this work by the eigengap heuristic [22], or determined by other approaches, e.g., silhouette coefficients [25]. If is constrained fully by known cell labels, clustering step can be omitted, e.g. study with Galen.

scDA classification for unannotated set

For 4 small data sets, we adopted five-fold cross-validation experiments. All the cells in data are randomly separated into five folds, while each fold is treated as the testing set and the remaining folds as the training set. We use a set of discriminant vectors obtained from training set and (i) inferred labels with representation matrix or (ii) the gold-standard cell labels to build KNN classifiers, i.e., classifications in unsupervised /supervised way, for annotating testing cells. Additionally, the prediction accuracy is impacted by the number of discriminant vectors. We pick the first d eigen vectors of the discrimination matrix , as they are ordered ascendingly by the eigen values. For pancreas data sets in Results section, we have small discovery subsets, of which (i) the cell number ranges from 3% to 10% of all the cells from large datasets for classification intra-datasets, (ii) less than 10% size of Baron for classification cross-datasets, and the remaining cells as validation subsets. Under the same sizes, the paired subsets were randomly generated for 100 times. At every time, we randomly selected the same proportion of cells from all the six cell types to guarantee a full learning with training sets. Then, the classification experiments in unsupervised /supervised way are similar to the above descriptions in small data sets. For 1 M dataset, we firstly run PCA and get the 2,000 medoids with 50 PCs as discovery landmarks for clustering, and then completed classification step.

Methods comparison

SC3 (version 1.14.0), t-SNE (package Rtsne version 0.15), Seurat (version 2.3.4) and SparseDC (version 0.1.17) were also applied to the small data sets, where all four are to benchmark cell clustering, and three for classification except t-SNE. For SC3, the default parameters provided by the author are used for clustering and marker-genes identification. For t-SNE, given the existing cell types in each data set, we took the expected smallest amount of cells as its perplexity parameter, and the projected data are clustered following k-means. For Seurat, we chose the improved PCA-based clustering pipeline as representative method from several packaged dimensionality reduction models. For SparseDC, we randomly divided the cells into two groups as it requires two conditions before identifying cell types and the related marker genes. Note that for SparseDC we only selected the cell type-specific genes for classification. The number of cell clusters is inferred by the default functions for SC3 and Seurat, or directly set as original group number for SparseDC and t-SNE. The whole sets of the identified marker genes are used when built corresponding KNN classifiers for SC3, Seurat and SparseDC.

Evaluation metrics

We used two metrics to evaluate the performances of scDA, one is adjusted Rand index (ARI) and the other is discrimination score (DS). ARI is used to quantify the similarity between the reference labels and the clusters/predictions obtained by scDA and other involved methods, as the cell labels were known or ever inferred by the authors in all the applied data sets. For n objects, there are two groups denoted as a and b, and the relationship of the two groups can be summarized by a contingency table. Each entry of the table stores the number of common observations between a and b. Given the table, the ARI is calculated aswhere is the entry at the ith row and jth column in the contingency table, , are the marginal sums for the corresponding groups, and () represents a binomial coefficient. The definition of DS is inspired by the fisher’s discriminant rule [26] and a measure of group coherence, i.e., silhouette coefficient [25]. We use it to assess how effective each projection vector (feature) is to discriminate a given pair of groups, for instance, denoted as a and b. With the cell i in a, let a(i) be the median distance of cell i to all other cells within a, and b(i) denotes the median distance of cell i to all cells in b. The DS for cell i is defined as: For simplicity, we further define distant DS (DDS) to measure the separations between the most distant groups, and neighboring DS (NDS) for neighboring groups. Both the scores range from + 1 to −1. And its median value over all the cells can then be used to measure the discriminative power of the examined feature to separate different groups. With the value bigger than 0, the feature could make positive contributions to distinguish groups. While if the value is closed to 1, it suggests that on the projected axis (or direction), cells are grouped tightly within respective clusters and far from the other clusters (similar to the Fisher’s ratio of between- to within-group distance), then indicates that the projection feature brilliantly discriminates cells from the two groups.

Biological insights

We adopted GSEA approach [27] to explore the biological significance of discriminant features (metagenes). Features are sorted in a descending order according to the absolute scores of (i.e., discrimination matrix), and gene sets of certain biological functions or sources are then tested. After FDR correction, the significant gene sets, e.g., P-value , can reveal the true biological information contained in the features. For example, when we applied scDA to the large data sets, we used the marker genes of known pancreatic cell types from CellMarker database [28] to determine whether the features are interpretable by the prior knowledge. It is implemented by phenoTest package (version 1.34.0).

Results

Study on small benchmark datasets

To exploit the characteristics of representation and discrimination matrices identified by scDA, we considered four small scRNA-seq datasets with well-defined cell types and different sequencing protocols (Table 1). As the representation graph is used to identify cell groups, and with discriminant vectors together to predict labels of new cells, we simply evaluate the two types of matrices by clustering/classification accuracies. We also conducted robustness tests using series of data subsets with different sparsity rates (see Methods). Note that both matrices and clustering/classification performances are assessed by adjusted Rand index (see Methods), which scores closer to 1 when the tested labels are similar to gold-standard ones. Here, we considered four established methods: SC3, t-SNE, Seurat[29] and SparseDC for fair comparisons. In respect of clustering, scDA performed generally better than other involved methods, or pretty close to the best results (Fig. 2a and Supplementary Fig. S1). It is remarkable that scDA was the most stable method with respect to the percent of dropouts. Overall, the clustering achievements reveal that scDA can recognize the inherent data structures, and this feature is crucially important as the discriminative projection learns from such representation configuration (Fig. 1a). To classify unlabeled cells with discriminant vectors, scDA outperformed the three tested models (t-SNE not included): stable and accurate across the spectrum of input gene sets (Fig. 2b and Supplementary Figs. S2, S3). While, the second best method is SC3, which uses marker genes from binary classifiers. But the individual markers are less informative to discriminate different cell clusters with the increase of sample size; scDA showed more obvious superiority, such as on Li and Camp data sets, in both unsupervised and supervised modes. Furthermore, we compared clustering time cost of the best two algorithms and found scDA (implemented in python) also computes faster than SC3 (Supplementary Fig. S7). In addition to these simple assessments, the cell-by-cell representation matrices could provide more affinity details than plain separations (Fig. 3a). Their illustrations appear nearly block-diagonal, indicating cells are tightly grouped in the identified clusters. Such “block” structures even can be seen in individual dimensions of the discrimination space (Fig. 3b). Here we use discrimination score (DS) to evaluate how powerful each single feature can discriminate different cell types (see Methods). We also compared scDA to principal component analysis (PCA), which also yields orthogonal rotations but without restrictions of cell relations. According to the scores, these projection vectors derived from scDA are overwhelmingly more favorable for distinct cell populations (Fig. 3c, d). Together with the minor differences between clustering and classifications (Supplementary Fig. S4), all the results showed that the scDA projection space with the preservation of overall-relationship is indeed spanned by remarkable discriminant vectors.

Fig. 3

Above all, scDA model can unsupervisedly recognize the intrinsic sample patterns, and also obtain a number of powerful discriminants. These characteristics allow us to apply scDA to large scRNA-seq profiles or across datasets.

Performance within large benchmark datasets

Inspired by the performance of scDA on small data sets, we evaluated the applicability of scDA with two large scRNA-seq profiles denoted as Large1 and Large2 (see Methods). Both data sets are generated to study human pancreas, a highly heterogeneous tissue with several determined cell types (Supplementary Table S1). The difference between them is that some cell groups inherent in the mixed data set (after batch correction), i.e., Large1, present less coherent than those in the single-sourced Large2 (Supplementary Fig. S8), thereby enabling to make comparative and contrasted evaluations with datasets in different qualities. Furthermore, we expected capacity of scDA to use small subsets of cells to discriminate the majority of cells, and thus conducted random experiments varying different sample sizes of discovery set (see Methods). Here, we use clustering ARI, classification ARI and their agreement to assess the utility of scDA. Firstly, we did unsupervised clustering with training data sets, and used the inferred labels to classify the other unlabeled cells. The results show that classification performance has strong correlation with clustering qualities in both large datasets (Fig. 4a and Supplementary Figs. S9, S10). However, more misclassifications were made when model learnt data patterns from very few samples, e.g., less than 5% of the total cells. This is probably caused due to the unbalanced distribution of given cell types (Supplementary Table S1), and small sampling ratio is more likely to obtain invalid training sets. Moreover, such mistakes still may occur even if sampling ratio increased in the dataset of Large1. Using this mixed dataset, we can also see more variances of ARI difference (Fig. 4b). The unsatisfactory performances may be subject to data qualities, for Large1 is multi-sourced and some confounding effects may still exist even after batch correction. While such situations get greatly improved to obtain accuracies and consistencies very close to those outcomes in Large2 when we use more training cells, e.g., 8% but still less than 10% of the total cell amount for Large1 data (Fig. 4b). All the results suggest the robustness of scDA to data coherences. Furthermore, we then constructed the classifier with the reference cell labels instead, and also observed the classification can achieve comparably reliable outcomes between Large1 and Large2 when training the classifier with cells occupying 8% ∼ 10% of total sample size (Fig. 4c). Our discriminant vectors show fair discrimination capacities to separate the farthest or closest cell groups (Supplementary Fig. S11) and these metagenes are found highly enriched with the known cell markers (Fig. 4d and Supplementary Fig. S12). All the results indicate the discrimination matrix can capture true biological variances inherent in scRNA-seq datasets; thus, with the matrix, scDA can well predict the unlabeled cells in unsupervised (Supplementary Fig. S10) or supervised (Fig. 4c) manners. Besides, we were able to analyze a very large 10X dataset with more than 1.3 million cells and 20 clusters, generating results in good agreement with its original annotations (Supplementary Table S2). Taken together, the above analyses show excellent generalization of scDA, enabling the model to use a handful of cells to well predict the labels of abundant cells within data sets.

Classification evaluation across datasets

We next tested the prediction ability of scDA across datasets as batches or batch effects are really common between different scRNA-seq experiments [30], e.g., from different labs, sequencing protocols or subjects. An approach, which could handle cross-dataset classification, is more preferred and practical. We used the human pancreas and bone marrow aspirate datasets (Table 1) for inter-dataset evaluation, and made directional classifications where to use small datasets to predict large datasets. We firstly tested classification performance of scDA across labs/protocols with five pancreas data sets. Specially, we made extreme tests to have scDA build classifiers from cells no more than 10% of Baron, the largest of pancreas profiles, and randomly divided all other datasets into different subsets, i.e., 2 folds for Lawlor, 3 folds for Segerstolpe, Muraro and Enge. Applied to the subsets, scDA obtained good cell annotations within each dataset in unsupervised way (Supplementary Fig. S13), consistent with our findings in previous sub-section. While, we then built classifiers with identified discrimination matrix and cell gold labels, and summarized their performances in Fig. 5a. All the classifiers performed well on predicting Lawlor’s dataset due to its small amount of cells. However, the classifiers originated from Enge subsets predicted the other datasets not as well, but still acceptable (ARI), as the rest classifiers (Fig. 5a). This may be because Enge is the only dataset that doesn’t contain PP (pancreatic polypeptide) cells (Supplementary Table S2) and produces incomplete classifiers. While, from Lawlor, Segerstolpe, Muraro subsets, the classifiers were theoretically unbiasedly trained, and performed much better on across-dataset classifications, even on predicting the largest dataset, i.e., Baron, with unsupervised cell annotations (Supplementary Fig. S14). We then used principal component analysis to project our discrimination space onto three-dimensional visualization space (Fig. 5b and c), which clearly shows the “shifted gaps” across the involved datasets. These gaps explicitly reveal external differences between batches. The results indicate that the scDA discrimination space preserves biological variances underlying trained dataset, therefore it allows to label unannotated dataset of the same tissue or similar condition in regardless of batch effects.

Fig. 5

Classification performance across pancreatic datasets. (a) Heatmap of cross-dataset classification ARI values. Each dataset in discovery set is randomly divided into 2 or 3 folds to unsupervisedly get representation and discrimination matrices. Classifiers with prior-provided cell annotations are to classify the whole cells in each prediction dataset. Classification ARI values are shown explicitly. Blue striped rectangle means NA value. (b, c) Visualization of trained and predicted cells in 3-dimenstional discrimination space. Fold2 of Muraro is used to train classifier for classification of other two Muraro folds in (b); then also to predict cell labels across Lawlor, Segerstolpe, Enge and Baron data sets in (c) Each point in graph represents a cell and colored by two annotation categories: data source and cell type. Cells are colored by given labels in training fold and predicted labels in testing datasets. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) We next tested classification performance of scDA across subjects with bone marrow (BM) aspirate dataset, which contains 4 human donors named as BM1 to BM4 according to the ascending order of available cell amounts. These sets have a total of 15 cell types belonging to 4 developmental lineages (Supplementary Fig. S15). To guarantee that the training set has at least 10 cells for one cell type, we trained with BMs1–3, i.e., about 25% the size of BM4, to classify cells in BM4. We first obtained corresponding discrimination matrix to the representation matrix, which was initially constrained by KNN using given cell annotations of BMs1-3. The top 14 discriminant vectors contribute most to different types of cells (Fig. 6a), thus are then used to build the classifier for BM4. Subsequently, we compared the classified assignments with the original annotations (Fig. 6b). Around 83.0% of cells are classified correctly according to their cell types. Among the mis-classified cells (17.0%), 13.0% are assigned to their related cell types within the same lineages, or predicted between HSC/Prog and other committed progenitor cells (2.4%), early Ery and proB, GMP (1.0%). It means that our mis-classifications mainly come from predicting the intermediated states or cell types with differentiation potency along the continuum of cells. Besides, we can also observe the putative differentiation relationships from undifferentiated cells to other lineages with our discrimination space (Fig. 6c and Supplementary Fig. S16). And the expression of marker signatures also supports our classification performance across different donors (Fig. 6d).

Fig. 6

Classification performance using subjects of BMs1-3 to annotate BM4. (a) Scatter chart of discrimination scores of top 20 discriminant vectors. We select the top discriminants by: (i) before the steep drop of median DDS, and (ii) with median NDS bigger than 0. (b) Prediction comparison for 3,738 hematopoietic cells in BM4. The square of heatmap represents the ratio of cells in prediction class over all cells of the same annotation. (c) Visualization of BMs1-4 in 3-dimenstional discrimination space. Dashed lines with arrows imply committed lineages from undifferentiated to mature states, which are inferred by cell labels (Supplementary Fig. 16). (d) Expression heatmap of 55 selected cell-type-specific genes (rows) across our annotated cell classes (columns). The median expression was calculated for each cell group. HSC: Hematopoietic stem cell; Prog: Progenitor; early Ery: Early erythroid progenitor; late Ery: Late erythroid progenitor; pro B: Progenitor B cell; B: Mature B cell; Plasma: Plasma cell; T: Naïve T cell; CTL: Cytotoxic T Lymphocyte; NK: Natural Killer cell; GMP: Granulocyte-macrophage progenitor; pro Mono: Promonocyte; Mono: Monocyte; pDC: Plasmacytoid dendritic cell; cDC: Conventional dendritic cell.

Discussion

Identifying genes (or metagenes) to discriminate different cell populations is effective to annotate large amount of cells within or across scRNA-seq data sets. The challenge is that cell states are temporally and spatially heterogeneous but always unknown in biological systems. While, existing methods usually adopt (i) label-based discrimination approaches, e.g., scID [31], which needs a reference annotation to identify cluster or cell type specific genes; or (ii) unsupervised dimensionality reductions, e.g., PCA, t-SNE, to learn hidden structures from unlabeled cells. Supervised discrimination methods can be used to infer unlabeled cells in validation datasets, but limited to the biological variances and cell compositions underlying the discovery dataset. While, unsupervised methods transform genes to metagenes, generally without the ability to transfer cell labels between different datasets. In fact, neither of the two types accounts for overall affinities inherent in multiple cell populations, thereby easily resulting in misclassification on new data. We demonstrated that our scDA method is able to successfully achieve powerful discriminant features in both unsupervised and supervised manners using eleven scRNA-seq data sets. The notable advantage of scDA is that the model finds out discriminant features based on cell-by-cell representation graph. The graph quantitatively reveals both local (within cell groups) and global (across different groups) variations in data sets, and is robust to data heterogeneity. scDA features learnt from full representation configuration, accommodate affinity differences between any pairs of cells, and present more specific for cell discrimination. Moreover, scDA uses an alternately iterative solution to take advantage of the inherent correlations between samples and features. Such approach makes our model data driven and can correct the obtained information from data sets. The optimal results become reliable when considering the underlying heterogeneity across profiled cells. We demonstrated that scDA accurately predicts plentiful unlabeled cells after obtaining discriminant metagenes and estimating potential cell types from a small amount of cells, such that unnecessary re-analysis can be avoided for large studies. The differences between estimation and prediction are proven quite small, suggesting that scDA is flexible to only implement the objective of recognizing cell populations for those small or finished studies. Moreover, the separable pipeline of scDA adopts spectral clustering and KNN classification in this work, but it also allows integrations of other useful approaches for scRNA-seq data, for example, hierarchical clustering or support vector machine algorithm. Theoretically, even the representation matrix of core model can be alternatively initialized (rather than KNN graph), e.g., with prior knowledge of cell groups, which helps to obtain more interpretable outcomes. Its potential scalability would deal with different practical issues and thus be well in conjunction with other information or analytical procedures [32].

CRediT authorship contribution statement

Qianqian Shi: Conceptualization, Methodology, Investigation, Formal analysis, Writing - original draft, Writing - review & editing, Visualization, Project administration, Funding acquisition. Xinxing Li: Software, Validation. Qirui Peng: Data curation. Chuanchao Zhang: Conceptualization, Methodology, Writing - review & editing. Luonan Chen: Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

33 in total

1. Metagenes and molecular pattern discovery using matrix factorization.

Authors: Jean-Philippe Brunet; Pablo Tamayo; Todd R Golub; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2004-03-11 Impact factor: 11.205

2. Identification of cell types from single-cell transcriptomes using a novel clustering method.

Authors: Chen Xu; Zhengchang Su
Journal: Bioinformatics Date: 2015-02-11 Impact factor: 6.937

3. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure.

Authors: Maayan Baron; Adrian Veres; Samuel L Wolock; Aubrey L Faust; Renaud Gaujoux; Amedeo Vetere; Jennifer Hyoje Ryu; Bridget K Wagner; Shai S Shen-Orr; Allon M Klein; Douglas A Melton; Itai Yanai
Journal: Cell Syst Date: 2016-09-22 Impact factor: 10.304

4. Mapping the Mouse Cell Atlas by Microwell-Seq.

Authors: Xiaoping Han; Renying Wang; Yincong Zhou; Lijiang Fei; Huiyu Sun; Shujing Lai; Assieh Saadatpour; Ziming Zhou; Haide Chen; Fang Ye; Daosheng Huang; Yang Xu; Wentao Huang; Mengmeng Jiang; Xinyi Jiang; Jie Mao; Yao Chen; Chenyu Lu; Jin Xie; Qun Fang; Yibin Wang; Rui Yue; Tiefeng Li; He Huang; Stuart H Orkin; Guo-Cheng Yuan; Ming Chen; Guoji Guo
Journal: Cell Date: 2018-02-22 Impact factor: 41.582

5. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes.

Authors: Nathan Lawlor; Joshy George; Mohan Bolisetty; Romy Kursawe; Lili Sun; V Sivakamasundari; Ina Kycia; Paul Robson; Michael L Stitzel
Journal: Genome Res Date: 2016-11-18 Impact factor: 9.043

6. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.

Authors: Alex A Pollen; Tomasz J Nowakowski; Joe Shuga; Xiaohui Wang; Anne A Leyrat; Jan H Lui; Nianzhen Li; Lukasz Szpankowski; Brian Fowler; Peilin Chen; Naveen Ramalingam; Gang Sun; Myo Thu; Michael Norris; Ronald Lebofsky; Dominique Toppani; Darnell W Kemp; Michael Wong; Barry Clerkson; Brittnee N Jones; Shiquan Wu; Lawrence Knutsson; Beatriz Alvarado; Jing Wang; Lesley S Weaver; Andrew P May; Robert C Jones; Marc A Unger; Arnold R Kriegstein; Jay A A West
Journal: Nat Biotechnol Date: 2014-08-03 Impact factor: 54.908

7. Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.

Authors: Åsa Segerstolpe; Athanasia Palasantza; Pernilla Eliasson; Eva-Marie Andersson; Anne-Christine Andréasson; Xiaoyan Sun; Simone Picelli; Alan Sabirsh; Maryam Clausen; Magnus K Bjursell; David M Smith; Maria Kasper; Carina Ämmälä; Rickard Sandberg
Journal: Cell Metab Date: 2016-09-22 Impact factor: 27.287

8. A Single-Cell Transcriptome Atlas of the Human Pancreas.

Authors: Mauro J Muraro; Gitanjali Dharmadhikari; Dominic Grün; Nathalie Groen; Tim Dielen; Erik Jansen; Leon van Gurp; Marten A Engelse; Francoise Carlotti; Eelco J P de Koning; Alexander van Oudenaarden
Journal: Cell Syst Date: 2016-09-29 Impact factor: 10.304

9. The Human Cell Atlas.

Authors: Aviv Regev; Sarah A Teichmann; Eric S Lander; Ido Amit; Christophe Benoist; Ewan Birney; Bernd Bodenmiller; Peter Campbell; Piero Carninci; Menna Clatworthy; Hans Clevers; Bart Deplancke; Ian Dunham; James Eberwine; Roland Eils; Wolfgang Enard; Andrew Farmer; Lars Fugger; Berthold Göttgens; Nir Hacohen; Muzlifah Haniffa; Martin Hemberg; Seung Kim; Paul Klenerman; Arnold Kriegstein; Ed Lein; Sten Linnarsson; Emma Lundberg; Joakim Lundeberg; Partha Majumder; John C Marioni; Miriam Merad; Musa Mhlanga; Martijn Nawijn; Mihai Netea; Garry Nolan; Dana Pe'er; Anthony Phillipakis; Chris P Ponting; Stephen Quake; Wolf Reik; Orit Rozenblatt-Rosen; Joshua Sanes; Rahul Satija; Ton N Schumacher; Alex Shalek; Ehud Shapiro; Padmanee Sharma; Jay W Shin; Oliver Stegle; Michael Stratton; Michael J T Stubbington; Fabian J Theis; Matthias Uhlen; Alexander van Oudenaarden; Allon Wagner; Fiona Watt; Jonathan Weissman; Barbara Wold; Ramnik Xavier; Nir Yosef
Journal: Elife Date: 2017-12-05 Impact factor: 8.140

10. CellMarker: a manually curated resource of cell markers in human and mouse.

Authors: Xinxin Zhang; Yujia Lan; Jinyuan Xu; Fei Quan; Erjie Zhao; Chunyu Deng; Tao Luo; Liwen Xu; Gaoming Liao; Min Yan; Yanyan Ping; Feng Li; Aiai Shi; Jing Bai; Tingting Zhao; Xia Li; Yun Xiao
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

1 in total

1. RDAClone: Deciphering Tumor Heterozygosity through Single-Cell Genomics Data Analysis with Robust Deep Autoencoder.

Authors: Jie Xia; Lequn Wang; Guijun Zhang; Chunman Zuo; Luonan Chen
Journal: Genes (Basel) Date: 2021-11-23 Impact factor: 4.096

1 in total