Literature DB >> 33613863

Automated methods for cell type annotation on scRNA-seq data.

Giovanni Pasquini^1,2, Jesus Eduardo Rojo Arias³, Patrick Schäfer¹, Volker Busskamp^1,2.

Abstract

The advent of single-cell sequencing started a new era of transcriptomic and genomic research, advancing our knowledge of the cellular heterogeneity and dynamics. Cell type annotation is a crucial step in analyzing single-cell RNA sequencing data, yet manual annotation is time-consuming and partially subjective. As an alternative, tools have been developed for automatic cell type identification. Different strategies have emerged to ultimately associate gene expression profiles of single cells with a cell type either by using curated marker gene databases, correlating reference expression data, or transferring labels by supervised classification. In this review, we present an overview of the available tools and the underlying approaches to perform automated cell type annotations on scRNA-seq data.

Entities: Chemical

Keywords: Automatic annotation; Cell state; Cell type; scRNA-seq

Year: 2021 PMID： 33613863 PMCID： PMC7873570 DOI： 10.1016/j.csbj.2021.01.015

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Individual cells represent the basic building blocks of tissues and organisms [1]. In multicellular species, cells specialize to fulfill highly specific functions. This specialization occurs as a result of intrinsic and extrinsic cues, with spatial location and molecular profiles strongly modulating cell fate and function [2], [3]. In this context, the advent of robust and accessible single-cell sequencing technologies [4] has enormously advanced our capacity to resolve and understand the molecular mechanisms regulating cell behavior, including fate decisions, developmental transitions, and responses to injury and disease. Single-cell RNA sequencing (scRNA-seq), in particular, has revolutionized biological research, and enables the categorization of cell types across multiple species, tissues, and contexts. From a biological perspective, classifying units into groups and categories is essential for their study, and makes it possible to draw parallels between analog units found either in different body compartments or in distinct species. However, while the human body is estimated to contain on average ~ 100 trillion cells, the number of distinct cell types remains unclear [5]. Moreover, to appropriately classify cells into different types, a fundamental question must be answered: what is a cell “type”? Defining cell identity is not a trivial endeavor, firstly because gene expression levels are not always binary, but can vary gradually over a spectrum. Secondly, and more important, transcriptional differences that would allow cells to be separated into different categories might not possess any biological relevance in terms of cellular function [6]. Considering that the human genome consists of approximately 20,000–25,000 genes, and that an average cell contains between 100,000 and 1,000,000 mRNA molecules [7], single-cell experiments require the use of amplification reactions to define the molecular profile of individual cells. Amplification, however, introduces technical variability, which increases the level of molecular noise and imposes additional difficulties in discerning between truly relevant changes in gene expression profiles and fluctuations in transcript levels inherent to cells. The transcriptional changes that occur during the cell cycle, for instance, remain challenging to isolate from simultaneous, albeit independent, cellular processes within the cell [8]. scRNA-seq is presently the dominant approach for defining cellular states at the molecular level [9]. To date, over 19,000 studies reporting the use of scRNA-seq in a variety of tissues, organisms, and contexts have been listed in Pubmed (search of term “single cell rna sequencing” on October 15th, 2020). Nonetheless, with novel sequencing methods being constantly developed [4], data standardization, curation, and integration have emerged as important challenges to be overcome for the precise and accurate categorization of cell types across species and developmental stages, as well as in injury and disease [10]. For many purposes, ensuring that a cell is significantly more similar to its in vivo counterpart than to other cell types might be sufficient [11]. Within the field of cellular engineering, profiling the transcriptional signatures of forward-programmed hiPSCs or of 3D organoids by scRNA-seq serves as a reliable quality-control measure by enabling the prompt and confident assessment of the capacity of diverse engineering strategies to drive cells into specific lineages [12], [13], [14]. For primary cells and tissues, however, the interpretation of scRNA-seq data requires caution and, when identifying novel cell types, validation by additional functional tests [15]. Starting from single-cell transcriptomes, numerous pipelines have been developed for studying cell heterogeneity [16], [17]. Manual annotation of cell types is often time-consuming and suffers from limited reproducibility. To overcome these limitations, computational methods have recently emerged for the automated annotation of cell clusters.

Automated cell type annotation of target scRNA-seq datasets

Analysis of scRNA-seq datasets generally starts with dimensionality reduction and clustering [16], [17]. Clusters represent groups of cells with relatively similar gene expression profiles. Hence, cells clustering together are likely to possess the same identity, although diverse cellular phenomena such as cell transitions might not be fully captured in scRNA-seq datasets. Consequently, cells might be assigned erroneous identities. Furthermore, the choice of clustering methods and granularity [18] yields different cluster numbers and compositions within the same dataset. Under-clustering, in particular, can result in insufficient resolution for identifying rare cell types or transition states. Thus, defining the appropriate granularity and assigning identities to the cells in each of the clusters generated, a process known as annotation, are both crucial steps in scRNA-seq data analysis. Here, we focus on the second of these steps. A straightforward approach for cluster annotation consists of the computation of differentially expressed genes (DEGs), or unbiased markers, that define the identity of each cluster. These are subsequently overlapped with specific marker-gene lists for the cell types expected in the dataset [19]. Alternatively, unbiased markers can be used as input for statistical tests or bioinformatic analysis tools, many of them originally developed to ascribe genotype-phenotype relations in bulk RNA-Seq datasets. The most widely used of these tools include over-representation analysis (ORA) and gene set enrichment analysis (GSEA), as well as AUCell, PROGENy and DoRothEA [20], [21]. The task of cell type annotation is not trivial: multiple tools have been developed to automatically annotate single cells from their mRNA expression profiles. A reference cell type information is needed to label a query gene expression profile with its correspondent cell. First, marker genes related to cell types can be easily exploited. Lists of marker genes can be independently built by researchers or gathered from databases and ontologies. On the other hand, gene expression profiles of a reference dataset can be directly used for the annotation of a query. In particular, these tools have been designed either to annotate entire clusters or, to avoid clustering biases, to classify individual cells (reviewed in Wang et.al. [22]). Moreover, important characteristics of a tool for automated cell type annotation include: the capability to assemble multiple reference datasets to smooth batch effects; the possibility to classify cell types according to a hierarchical structure which can be given as input or learned from the data; the computation of a score of similarity between reference and query which can help identifying multiple cell types being harbored by the same cell; the ability of classifying cells as “unassigned” or “unknown” when they have an identity not represented in the reference. Beside such functionalities, three main methodological approaches can be identified (Table 1). The first approach relies on information from publicly available databases and ontologies describing cell type-specific markers (Fig. 1A). A second set of methods uses labeled scRNA-seq datasets as input for cell type identification, finding the best correlation between the reference and query datasets (Fig. 1B). Finally, a number of tools use a third alternative: supervised learning, which involves the training of a classifier with a labeled reference (Fig. 1C). Thereafter, the classifier is capable of determining cell types in unlabeled datasets. These methods, and the informatic tools employing them, are discussed in further detail below, with particular focus on their strengths and limitations.

Table 1

Tools for automated cell type identification.

	Marker genes	Reference dataset	Tool name	Language	Computational approach	Unassigned	Multiple reference	Hierarchical classification	Additional features	Ref
Marker gene database-based	•		scCATCH	R	Scoring system		–			[23]
	•		SCSA	Python	Scoring system		–			[24]
	•		SCINA	R	Bimodal distribution fitting to marker genes	✓	–			[25]
	•		CellAssign	R	Probabilistic Bayesian model	✓	–			[26]

Correlation-based		•	scmap-cluster	R, web app	Cosine, Spearman, Pearson	✓	✓			[27]
		•	scmap-cell	R, web app	Cosine distance based kNN	✓	✓			[27]
		•	SingleR	R	Spearman	score	✓		Harmonization of the labels allows for multiple reference datasets.	[28]
		•	CHETAH	R, Shiny app	Spearman + confidence	✓		✓		[29]
		•	scMatch	Python	Spearman, Pearson	score	✓		Cell lineage is added as lower level of classification.	[30]
		•	ClustifyR	R	Spearman, Pearson, Kendall, cosine	✓			Implements a consensus correlation score.	[31]
		•	CIPR	R, Shiny app	Dot product, Spearman, Pearson	score			Dot product implicitly involves feature selection.	[32]

Supervised classification-based		•	CaSTLe	R	XGBoost classifier	✓				[33]
		•	Moana	Python	kNN-smoothing + SVM					[34]
		•	LAmbDA	Python	Multiple ML techniques		✓		Training on multiple datasets to create a shared representation of the labels to smooth batch effects	[35]
		•	superCT	Web app	Artificial Neural Network		✓		Classifier trained on MCA with the possibility to add user defined datasets	[36]
		•	SingleCellNet	R	Random Forest	score			Similarity scores allow to find transition states and multiple identities in the same cell.	[37]
	•		Garnett	R	Elastic net regression	✓	✓	✓	Classification can be done using open chromatin information derived from scATAC-seq	[38]
		•	scPred	R	SVM	✓			Allows to train different classifiers for defined labels	[39]
		•	ACTINN	Python	Artificial Neural Network				Robust against batch effects induced by sequencing technologies	[40]
		•	OnClass	Python	kNN and Bilinear Neural Network	✓	✓	✓	Use of a CellOntology to impute labels not present in the training data.	[41]
		•	scClassify	R, Shiny app	Weighted kNN classifier	✓	✓	✓	Hierarchical cell type tree as reference. It combines six similarity matrices with five feature selection methods.	[42]

Others		•	scANVI	Python	kNN classifier					[43]
	•		Capybara	R	Quadratic programming	score			Cell engineering-oriented	[44]
		•	scID	R	Fisher’s Linear Discriminant Analysis	✓				[45]
		•	scNym	Python	Adversarial Neural Network	✓	✓			[46]

Fig 1

Approaches for cell type annotation of scRNA-seq datasets. scRNA-seq datasets can be automatically annotated by tools implementing one of three approaches: annotation by marker gene databases; correlation-based methods; and annotation by supervised classification. The task of annotating a query scRNA-seq dataset consists of assigning a cell type identity to each one of the query single cells, or to a group of cells at once i.e. an unbiasedly calculated cluster. (A) Marker gene database-based annotation takes advantage of cell type atlases. Literature- and scRNA-seq analysis-derived markers have been assembled into reference cell type hierarchies and marker lists. In this approach, basic scoring systems are used to ascribe cell types at the cluster level in the query dataset. (B) Correlation-based methods make use of multiple correlation measures to compare gene expression profiles between a reference and a query dataset, at either single-cell or cluster level, by the use of centroids (pseudo-cells obtained by averaging the single-cell gene expression level of an entire cluster). Some of these tools assemble a reference of cell type gene-expression profiles from an ensemble of published studies and bulk RNA data repositories. The annotation step in this approach consists of finding the reference cell type that best correlates to the query cell or cluster, and every tool uses multiple steps for accurately finding the best match. (C) Annotation by supervised classification uses machine learning techniques for training a classifier on reference labeled scRNA-seq datasets. The classifier is subsequently applied to the query. Supervised learning is a powerful tool for building a model distribution of training labels as a function of features. Machine learning techniques offer a variety of alternatives in the training step and allow for hierarchical classification, which permits a more biologically-relevant identification of cell types.

Tools for automated cell type identification. Approaches for cell type annotation of scRNA-seq datasets. scRNA-seq datasets can be automatically annotated by tools implementing one of three approaches: annotation by marker gene databases; correlation-based methods; and annotation by supervised classification. The task of annotating a query scRNA-seq dataset consists of assigning a cell type identity to each one of the query single cells, or to a group of cells at once i.e. an unbiasedly calculated cluster. (A) Marker gene database-based annotation takes advantage of cell type atlases. Literature- and scRNA-seq analysis-derived markers have been assembled into reference cell type hierarchies and marker lists. In this approach, basic scoring systems are used to ascribe cell types at the cluster level in the query dataset. (B) Correlation-based methods make use of multiple correlation measures to compare gene expression profiles between a reference and a query dataset, at either single-cell or cluster level, by the use of centroids (pseudo-cells obtained by averaging the single-cell gene expression level of an entire cluster). Some of these tools assemble a reference of cell type gene-expression profiles from an ensemble of published studies and bulk RNA data repositories. The annotation step in this approach consists of finding the reference cell type that best correlates to the query cell or cluster, and every tool uses multiple steps for accurately finding the best match. (C) Annotation by supervised classification uses machine learning techniques for training a classifier on reference labeled scRNA-seq datasets. The classifier is subsequently applied to the query. Supervised learning is a powerful tool for building a model distribution of training labels as a function of features. Machine learning techniques offer a variety of alternatives in the training step and allow for hierarchical classification, which permits a more biologically-relevant identification of cell types.

Cluster annotation with marker gene databases

The widespread adoption of diverse scRNA-seq platforms has driven a rapid increase in the number of transcriptomic datasets published over the last years. Thousands of scRNA-seq datasets are now publicly available, with studies aiming not only to reveal the cellular heterogeneity of diverse tissues and organisms, but also to logically and accurately classify cells (Table 2) [47], [48], [49]. To unify results and organize information about cell types and states, thousands of publications have been manually curated and available datasets have been systematically re-analyzed, with results deposited in platforms such as CellMarker [50] and PanglaoDB [51]. In CellMarker, the cataloguing of manually-curated human and mouse cell type markers has allowed 13,605 genes to be mapped to 467 human cell types, and 9148 genes to 389 mouse cell types. For these analyses, gene-expression markers were gathered from over a thousand single-cell sequencing publications retrieved by specific PubMed queries, and collected from handbooks or company databases, such as those of BD biosciences and R&D Systems. From these datasets, CellMarker categorized cell types according to their tissue of origin, then hierarchically grouped them by localization, morphology, and functionality. PanglaoDB, is a cell type atlas in which information on gene expression and its relation to cell types is collected. To build PanglaoDB, an internal cell type marker database was assembled by automated abstract mining, followed by manual curation of the literature. Currently, PanglaoDB comprises 6631 marker genes mapping to 155 cell types. Similarly, CancerSEA provides markers, particularly protein-coding and long-non-coding transcripts, for 14 relevant functional cell states in cancer, including proliferative, invasive, and stemness states [52]. Altogether, these databases and online repositories offer an ample and ready-to-use source of cell type to marker gene relations derived from scRNA-seq experiments.

Table 2

Publicly available repositories and datasets used by automated annotation tools.

	Data type	Species	Info	Tissues/cell types	Ref
Human Primary Cell Atlas	Microarray	Human	Cell type profiles	Cell lines, tissues, primary cells	[53]
Blueprint	Bulk RNAseq	Human	Cell type profiles	Cell lines, tissues, primary cells	[54]
FANTOM5	Bulk RNAseq	Human, Mouse, rat, dog and chicken	Cell type profiles	15 cell types	[55]
Encode	Bulk RNAseq	Human, Mouse, Fly and Worm	Cell type profiles	Cell lines, tissues, primary cells	[56]
HCA	Single cell RNAseq	Human	Multi-organ datasets	33 organs	[49]
MCA	Single cell RNAseq	Mouse	Multi-organ dataset	98 major cell types	[48]
Tabula Muris	Single cell RNAseq	Mouse	Multi-organ datasets	20 organs and tissues	[47]
Allen Brain Atlas	Single nuclei RNAseq	Human and Mouse	Brain datasets	69 neuronal cell types	[57]
CellMaker	Marker genes	Human and Mouse	Marker Database	467 (human), 389 (mouse)	[50]
PanglaoDB	Marker genes	Human	Marker Database	155 cell types	[51]
CancerSEA	Marker genes	Human cancer	Marker Database	14 cancer functional states	[52]

Publicly available repositories and datasets used by automated annotation tools. Tools that have been natively developed to use the databases described previously for cell type inference include scCATCH [23] and SCSA [22]. In both cases, reference lists of markers were constructed by merging information from several sources. To use scCATCH, a tissue-specific cell taxonomy reference database known as CellMatch was assembled. Within it, markers were unified from CellMarker, the Mouse Cell Atlas project, CancerSEA, and the CD Marker Handbook. SCSA uses a collection of markers that were produced by merging CellMarker and CancerSEA. Additionally, SCSA allows users to add custom reference markers. Both scCATCH and SCSA calculate marker genes for the inputted clusters, and a scoring system subsequently assigns a cell type to each cluster. Additionally, SCSA provides an automatic GO term enrichment option, thereby adding information on the biological functions of the cells within each cluster. As well as the tools above, more sophisticated statistical approaches have been used to transfer prior knowledge when trusted reference cell type markers are available, by performing a probabilistic assignment of the reference cell types. These include SCINA [25], which operates at the cluster level by fitting a bimodal distribution to marker genes; and CellAssign, which works at the single-cell level [26] using a Bayesian probabilistic model.

Correlation-based annotation

Correlation is the most straightforward statistical method for automatic comparison of gene expression data: it can easily use a reference dataset to reveal information about an unlabeled dataset. Moreover, correlating the expression levels of a set of genes, or of an entire transcriptomic profile, is a more refined way to find similarities between datasets than simply scoring the presence of marker genes in clusters. By combining the expression level of each gene with correlation methods, it is possible to evaluate both linear and non-linear interactions. Different strategies employing correlation have already been implemented in a variety of tools. These tools perform two main types of comparisons: either single cell-to-reference or cluster-to-reference. CIPR [32] and ClustifyR [31], for instance, employ a cluster-to-reference strategy. In particular, these tools cross-correlate unlabeled clusters to a reference of annotated clusters, with cell type labels assigned according to the best-correlating reference cell type. CIPR and ClustifyR represent clusters as centroids. Each centroid is a pseudo-cell whose expression level for each gene equals its averaged expression level in all cells of that cluster. After this step, both tools implement Spearman (default in ClustifyR) and Pearson correlation coefficients to determine the identity of each pseudo-cell (or cluster) in the query, with ClustifyR also integrating Kendall correlation and Cosine similarity, plus a consensus correlation score. In contrast to ClustifyR, CIPR recommends calculating the dot product of the logarithm-transformed fold change for each cluster, which implicitly involves feature selection. In contrast, tools such as scmap [27], SingleR [28], and scMatch [30] correlate each cell of the query dataset to a reference collection of cell types or annotated clusters. SingleR and scMatch function in a similar way, as both use a collection of bulk datasets generated from human single cell types (Table 2). In particular, SingleR uses reference expression data from Blueprint [54], Encode [56], and the Human Primary Cell Atlas [53], while scMatch also uses FANTOM5 [58] and UCSC Xena Cancer Browser (https://xenabrowser.net) data, thereby also enabling the classification of cancer-related datasets. Moreover, since no assumptions can be made on the distribution of gene expression, both tools recommend the non-parametric Spearman rank correlation. To account for potential redundancies in the collection of bulk references, both SingleR and scMatch have an initial step for finding the top correlated cell types, and subsequent steps for refining these associations. Beyond this, the annotation strategy in scMatch groups cells by cell lineage or other ontological terms at a more general level than cell type. In contrast, SingleR, which was first published using only bulk references, has recently been updated to be used with single-cell references, and now incorporates a number of novel functionalities. For instance, there is now an option for using multiple reference datasets through label harmonization. Another tool for automated annotation, scmap, offers the scmap-cluster and scmap-cell options to annotate cells either to a reference cluster or to a reference cell. Thereby, it is possible to annotate single cells without requiring the user to define clusters a priori. To achieve this, scmap-cluster computes the similarity between each cell and the centroid of each reference cluster, while scmap-cell uses a fast-approximate k-nearest-neighbor search through product quantization, with an Euclidean distance algorithm adapted to incorporate cosine distance. One requirement for the use of correlation methods, whether they map individual cells or entire clusters to a reference, is the selection of features. Feature selection consists of identifying and removing as many irrelevant or redundant features (genes in this context) from the data. Removing redundant genes is especially important when comparing datasets sequenced by different technologies, as the use of distinct sequencing parameters, i.e. variations in sequencing depth, may result in a different number of genes being detected in each cell. Of note, Kiselev and colleagues have reported that intra-dataset annotation performs poorly when unbiased gene selection methods such as highly variable genes (HVGs) or M3Drop are used instead of other feature-selection methods [59]. In their study, the best results were obtained by using random genes as features. In contrast, SingleR and ClustifyR utilize HVGs and DEGs, respectively, as default features. Feature-selection strategies were systematically tested during the development of CIPR and scmap. In CIPR, the performance of different correlation methods was evaluated when used either on all genes, or on a subset. Results suggested that methods using dot product operations on DEGs are best able to discriminate similar cell types, as they account for both down- and up-regulated genes. A distinct alternative was implemented in CHETAH [29], the only correlation-based tool implementing a method for hierarchical classification (see the next section). In CHETAH, candidate cells are compared to reference subsets in multiple rounds, with a different set of genes used to measure the similarity in each round. The best results were reported when the 200 genes with the largest absolute fold-change between a candidate cell and the averages of the sub-reference were used.

Annotation by supervised classification

Automatic cell type annotation methods attempt to identify similarities between scRNA-seq datasets, overcoming the intrinsic noise and variability of the data. Indeed, multiple confounding factors underlie the variability found across scRNA-seq datasets. Prominent drivers of variability include the sequencing platform used, the depth of sequencing chosen for the experiment, and the method of sample preparation. Such characteristic noise and the multidimensionality of scRNA-seq data have made machine-learning methods an outstanding resource for fulfilling a variety of tasks in analysis pipelines, including dimensionality-reduction operations [60], [61]. Supervised classification, i.e. the transferring of labels from labeled to unlabeled datasets, is a classic paradigm in machine learning, for which a wide range of techniques have been developed [62]. In the field of machine learning, the term ‘supervised learning’ is used to refer to the building of a model distribution of labels (cell types) in terms of a set of features (genes) which is trained on ground truth data (a previously annotated dataset). Thereafter, trained models are used to assign labels to instances of unlabeled datasets, according to their relative features. For automatic cell type annotation in scRNA-seq datasets, tools have already been developed which use supervised classification: here we highlight the main applications for scRNA-seq datasets. Among the first classifiers for cell populations, CellNet was developed based on the Random Forest method [63]. Similarly, the same research group recently proposed a tool for single-cell classification named SingleCellNet [37]. Random Forest techniques are derived from decision trees, a class of logic-based algorithms [64], and have already proven useful in handling similarities within a scRNA-seq dataset [65]. By providing a quantitative score for the similarity between each cell class and each cell in the query dataset, SingleCellNet makes it possible to find multiple cell types associated with a single cell or with a group of cells: an extremely valuable functionality in the frame of cell type engineering, as cells in transition states can be identified. In addition to Random Forest techniques, the k-nearest-neighbor (kNN) instance-based learning algorithm is also used for automatic cell type classification. This method is based on the principle that, in their feature-based representation, instances of the same class localize close to each other. Thus, kNN classifies cell type by representing labeled and unlabeled instances together in the same dimensions: this assigns unlabeled instances to the most-represented class in the neighborhood. OnClass is an example of a tool that takes advantage of the power of kNN classifiers. OnClass is able to impute labels not present in the training dataset by creating a low dimensional representation of the training set [41]. In this representation, a self-implemented CellOntology enables storage of information about numerous labels, even if they are unseen in the training set. Then, novel label imputation is carried out using a bilinear neural network. Similarly, a weighted kNN classification lies at the core of scClassify [42], a tool with the capacity to derive a hierarchical reference representation from multiple datasets. Thereafter, at each node of the reference cell type tree, scClassify trains 30 classifiers obtained by combining six similarity metrics with five feature selections. Artificial neural networks (ANNs) are the basis for another class of supervised classifiers commonly referred to as perceptron-based. The great capacity of these techniques for solving non-linear relations between classes and features, together with advances in computation speed over recent years, has made ANN-based methods popular for tackling numerous tasks in the biomedical field [66], [67]. Examples of tools engaging ANNs for single-cell supervised classification are LAmbDA, SuperCT, and ACTINN [35], [36], [40]. LAmbDA is a framework that aims to perform multiple tasks on scRNA-seq data. ANNs perform the classification, interpreted as a transfer learning problem. By conducting the ANN training step on raw data from multiple datasets, LAmbDA creates a generalized representation of shared labels while correcting for potential batch effects. Another tool, SuperCT, was designed as a framework in which a supervised classifier is trained on all datasets within the Mouse Cell Atlas (MCA) [48], with the user able to expand this reference by submitting new datasets. In tests using the Tabula Muris Atlas as a training dataset [47], ACTINN was highly accurate in classifying strictly related cellular subtypes, and was robust against batch effects arising from the use of different sequencing techniques. As with ANNs, Support Vector Machines (SVMs) have also been used in the context of scRNA-seq data analysis. SVMs allow multicollinearity and non-linear relationships to be harnessed within scRNA-seq data. Moana [34] and scPred [39] are two examples of tools which apply SVM-based classifiers on PCA-transformed gene expression matrices. Thereby, these tools prevent single genes from having an excessive impact on cell classification. More particularly, scPred uses SVMs with radial kernels as a standard, but allows the user to train other prediction models on specific labels as well (available in the R package caret [68]). Moana engages a hierarchical classification by recursively clustering and training a classifier over multiple iterations. Thereafter, Moana uses kNN to smooth the expression data minimally before training a SVM with a linear kernel to classify data in clusters in the two-PC dimension space. This operation is conducted for each cluster, until all labels in the reference dataset have been separated. Using this strategy for hierarchical classification allows Moana to maximize the number of cells it can analyze while minimizing the computation time required for the training step. The hierarchical classification approach is utilized not only for its efficiency in terms of computation time, but mostly because the classification it performs resembles the structured identity of cell types in tissues more closely. In fact, hierarchies between labels can be learned by the reference data (as in CHETAH (previous section), OnClass, and scClassify) or directly given as input by the user. The latter strategy was implemented in Garnett [38], a tool that allows cell type assignment according to a tree of cell types. Garnett creates a hierarchical model from the reference dataset by using cell type markers defined by the user. On these models, the software then trains an elastic net classifier. Notably, Garnett has been adapted to also be capable of classifying cell types according to their “gene activity score” as obtained from scATAC-seq data. To harmonize cell counts between datasets while classifying unlabeled data with information from a labeled reference, semi-supervised learning techniques have also been implemented in the frame of scRNA-seq data analysis [43], [46]. Furthermore, Capybara [44] uses an unsupervised approach based on quadratic programming to score cells with a measure of cell identity which represents a linear combination of the cell types in the reference. Capybara can identify cells harboring characteristics of multiple cell types. The cell type classification task is performed in this tool by a statistical framework that makes it possible to find transition states between labels in the reference. The power of scRNA-seq analysis tools lies primarily on their capacity to represent as many genes as possible in an unbiased manner. As computation tasks are presently feasible even when working with standard scRNA-seq dataset sizes, feature selection is not strictly required for supervised classification. Nonetheless, outliers and redundant features are detrimental to model training and classification in terms of computation speed and accuracy. Thus, feature selection and processing are key to enhancing the performance of supervised classification algorithms. Among the tools described in this section, many perform an initial data-processing step, while others select features. SingleCellNet, for example, selects features by pair transformation, keeping and binarizing only the most discriminant pairs before the training. CaSTLe also selects features, by using univariate methods such as selection by mean expression, mutual information, and correlation between genes, before splitting data into four bins according to expression levels [33]. ACTINN conducts a simple feature cleansing by considering as outlier genes whose mean expression level and standard deviation lie within the highest or lowest percentile. Instead of selecting features, Moana performs an initial kNN-smoothing step to remove unwanted noise. It is important to note that one main assumption of Moana (as well as superCT and ACTINN) is that all cell types existing in the query need to be present in the reference, to prevent that unseen cell types will be associated to the wrong labels. On the other hand, other tools implement a strategy to classify cells as “unassigned” or “unknown”, frequently by defining a cut off score for the annotation to be trusted. The ability of classify correctly the unseen cell types is key for these methods and it is benchmarked in scClassify, CHETAH, scPred and Garnett. Reports suggest that all tools exploiting supervised clustering are reliable, efficient, and accurate. However, these conclusions might be the consequence of classification tests being relatively simple: classifying peripheral blood mononuclear cells or pancreatic cell types is relatively straightforward, given their high level of heterogeneity and the marked differences in the transcriptional profiles of the cells within each dataset. One task likely to be significantly more challenging is the identification of cell subtypes, for example within the neuronal classes, as only few genes may be crucial for their discrimination [57]. In case studies where the marker genes to be used are clearly defined, approaches like Garnett, SCINA, and CellAssign may outperform brute-force approaches. Similarly, if datasets with meaningful features and sufficient label representation are available, supervised learning methods might offer a powerful and flexible alternative for their analysis.

Summary and outlook

In the present review, we summarize the three main approaches used for automated cell type annotation on scRNA-seq data. A first category of tools relies on a set of trusted cell type-specific markers to ascribe the cell identity in the query. Such markers can be both database-derived or manually-curated lists. In the first case, the reference cell types we can use in the annotation is exhaustive, but the annotation can be uncertain if the query is not clean. On the other hand, manually-curated lists are usually limited in terms of cell type coverage, but allow for the use of sophisticated statistical approaches. Correlation-based methods require annotated bulk or single-cell RNA datasets as reference. These methods easily allow multiple references and large consortia data to be merged, making the annotation as comprehensive as possible. Ultimately, supervised classification methods represent a valid alternative when a meaningful reference dataset is available for the training step, being able to overcome characteristic scRNA-seq noise and batch effects given by different sequencing technologies. Automated cell type annotation tools have been assessed in a broad range of tissues, sample conditions and applications (Table 3). Notably, a benchmarking of supervised classification-based methods for automatic cell annotation was recently conducted by Abdelaal and colleagues [69], showing that each method possesses specific advantages over the others, and very good performances by using SVM with rejection option. Another benchmark study, comparing different classes of tools, shows that combining multiple tools is highly encouraged for improving the accuracy [70].

Table 3

Classification challenges benchmarked in the original publication of each tool.

Tool name	Pancreas	PBMC	BMDC	CBMC	Lung	Kidney	Liver	Retina	Brain	Differetiating/ transistioning cells	Whole organism	Tumor cells	Cross-species	Cross-platform	Ref
scCATCH	✓	✓				✓			✓						[23]
SCSA		✓			✓							✓			[24]
SCINA		✓							✓			✓			[25]
CellAssign							✓			✓		✓		✓	[26]
scmap-cluster	✓							✓	✓	✓	✓				[27]
scmap-cell	✓							✓	✓	✓	✓				[27]
SingleR			✓		✓					✓					[28]
CHETAH	✓	✓		✓								✓			[29]
scMatch		✓										✓			[30]
ClustifyR		✓		✓					✓		✓		✓		[31]
CIPR		✓										✓			[32]
CaSTLe	✓	✓						✓			✓			✓	[33]
Moana	✓	✓								✓				✓	[34]
LAmbDA	✓								✓						[35]
superCT	✓			✓					✓		✓	✓		✓	[36]
SingleCellNet	✓									✓			✓	✓	[37]
Garnett		✓			✓				✓		✓		✓	✓	[38]
scPred	✓	✓										✓		✓	[39]
ACTINN		✓									✓			✓	[40]
OnClass											✓				[41]
scClassify	✓	✓			✓				✓		✓			✓	[42]
scANVI															[43]
Capybara										✓	✓				[44]
scID								✓	✓						[45]
scNym		✓				✓					✓			✓	[46]

Classification challenges benchmarked in the original publication of each tool. In the future, integrating the crucial role played by post-transcriptional regulatory mechanisms and epigenetic modifications on the genome with the in-depth knowledge currently being generated on the transcriptional profiles of a myriad of cell types across species and contexts will bring a better understanding of cellular identity. While the drop in the cost of sequencing technologies has allowed scRNA-seq technologies to become widely adopted, the implementation of strategies for the simultaneous extraction of transcriptomic, proteomic, and genomic regulatory information at the single-cell level will progressively allow for more refined cellular classifications [71], [72], [73], [74], [75], [76], [77]. The field will definitely benefit from a variety of computational tools for the efficient collection, standardization, and curation of discoveries related to cellular and molecular functions. Even in the absence of a final consensus in terms of what ultimately is entailed by the concept of cellular identity, sufficiently accurate approximations are expected to enable important advances in the field of cell and gene therapy.

CRediT authorship contribution statement

Giovanni Pasquini: Conceptualization, Writing - review & editing. Jesus Eduardo Rojo Arias: Writing - review & editing. Patrick Schäfer: Review & editing. Volker Busskamp: Review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

65 in total

Review 1. Identifying cell types to interpret scRNA-seq data: how, why and more possibilities.

Authors: Ziwei Wang; Hui Ding; Quan Zou
Journal: Brief Funct Genomics Date: 2020-07-29 Impact factor: 4.241

Review 2. Tissue Stem Cells: Architects of Their Niches.

Authors: Elaine Fuchs; Helen M Blau
Journal: Cell Stem Cell Date: 2020-10-01 Impact factor: 24.633

3. Mapping the Mouse Cell Atlas by Microwell-Seq.

Authors: Xiaoping Han; Renying Wang; Yincong Zhou; Lijiang Fei; Huiyu Sun; Shujing Lai; Assieh Saadatpour; Ziming Zhou; Haide Chen; Fang Ye; Daosheng Huang; Yang Xu; Wentao Huang; Mengmeng Jiang; Xinyi Jiang; Jie Mao; Yao Chen; Chenyu Lu; Jin Xie; Qun Fang; Yibin Wang; Rui Yue; Tiefeng Li; He Huang; Stuart H Orkin; Guo-Cheng Yuan; Ming Chen; Guoji Guo
Journal: Cell Date: 2018-02-22 Impact factor: 41.582

4. An expression atlas of human primary cells: inference of gene function from coexpression networks.

Authors: Neil A Mabbott; J Kenneth Baillie; Helen Brown; Tom C Freeman; David A Hume
Journal: BMC Genomics Date: 2013-09-20 Impact factor: 3.969

5. scClassify: sample size estimation and multiscale classification of cells using single and multiple reference.

Authors: Yingxin Lin; Yue Cao; Hani Jieun Kim; Agus Salim; Terence P Speed; David M Lin; Pengyi Yang; Jean Yee Hwa Yang
Journal: Mol Syst Biol Date: 2020-06 Impact factor: 11.429

6. M3Drop: dropout-based feature selection for scRNASeq.

Authors: Tallulah S Andrews; Martin Hemberg
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

Review 7. How transcription factors drive choice of the T cell fate.

Authors: Hiroyuki Hosokawa; Ellen V Rothenberg
Journal: Nat Rev Immunol Date: 2020-09-11 Impact factor: 53.106

8. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.

Authors: Dvir Aran; Agnieszka P Looney; Leqian Liu; Esther Wu; Valerie Fong; Austin Hsu; Suzanna Chak; Ram P Naikawadi; Paul J Wolters; Adam R Abate; Atul J Butte; Mallar Bhattacharya
Journal: Nat Immunol Date: 2019-01-14 Impact factor: 25.606

9. CellMarker: a manually curated resource of cell markers in human and mouse.

Authors: Xinxin Zhang; Yujia Lan; Jinyuan Xu; Fei Quan; Erjie Zhao; Chunyu Deng; Tao Luo; Liwen Xu; Gaoming Liao; Min Yan; Yanyan Ping; Feng Li; Aiai Shi; Jing Bai; Tingting Zhao; Xia Li; Yun Xiao
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. scID Uses Discriminant Analysis to Identify Transcriptionally Equivalent Cell Types across Single-Cell RNA-Seq Data with Batch Effect.

Authors: Katerina Boufea; Sohan Seth; Nizar N Batada
Journal: iScience Date: 2020-02-14

16 in total

Review 1. Recent Advances of Deep Learning for Computational Histopathology: Principles and Applications.

Authors: Yawen Wu; Michael Cheng; Shuo Huang; Zongxiang Pei; Yingli Zuo; Jianxin Liu; Kai Yang; Qi Zhu; Jie Zhang; Honghai Hong; Daoqiang Zhang; Kun Huang; Liang Cheng; Wei Shao
Journal: Cancers (Basel) Date: 2022-02-25 Impact factor: 6.639

2. A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data.

Authors: Xiaowen Cao; Li Xing; Elham Majd; Hua He; Junhua Gu; Xuekui Zhang
Journal: Front Genet Date: 2022-02-23 Impact factor: 4.599

Review 3. Mapping and Validation of scRNA-Seq-Derived Cell-Cell Communication Networks in the Tumor Microenvironment.

Authors: Kate Bridges; Kathryn Miller-Jensen
Journal: Front Immunol Date: 2022-04-28 Impact factor: 8.786

4. scDeepSort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network.

Authors: Xin Shao; Haihong Yang; Xiang Zhuang; Jie Liao; Penghui Yang; Junyun Cheng; Xiaoyan Lu; Huajun Chen; Xiaohui Fan
Journal: Nucleic Acids Res Date: 2021-12-02 Impact factor: 16.971

10. Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction.

Authors: Wenjing Ma; Kenong Su; Hao Wu
Journal: Genome Biol Date: 2021-09-09 Impact factor: 13.583