| Literature DB >> 33613863 |
Giovanni Pasquini1,2, Jesus Eduardo Rojo Arias3, Patrick Schäfer1, Volker Busskamp1,2.
Abstract
The advent of single-cell sequencing started a new era of transcriptomic and genomic research, advancing our knowledge of the cellular heterogeneity and dynamics. Cell type annotation is a crucial step in analyzing single-cell RNA sequencing data, yet manual annotation is time-consuming and partially subjective. As an alternative, tools have been developed for automatic cell type identification. Different strategies have emerged to ultimately associate gene expression profiles of single cells with a cell type either by using curated marker gene databases, correlating reference expression data, or transferring labels by supervised classification. In this review, we present an overview of the available tools and the underlying approaches to perform automated cell type annotations on scRNA-seq data.Entities:
Keywords: Automatic annotation; Cell state; Cell type; scRNA-seq
Year: 2021 PMID: 33613863 PMCID: PMC7873570 DOI: 10.1016/j.csbj.2021.01.015
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Tools for automated cell type identification.
| Marker genes | Reference dataset | Tool name | Language | Computational approach | Unassigned | Multiple reference | Hierarchical classification | Additional features | Ref | |
|---|---|---|---|---|---|---|---|---|---|---|
| Marker gene database-based | • | scCATCH | R | Scoring system | – | |||||
| • | SCSA | Python | Scoring system | – | ||||||
| • | SCINA | R | Bimodal distribution fitting to marker genes | ✓ | – | |||||
| • | CellAssign | R | Probabilistic Bayesian model | ✓ | – | |||||
| Correlation-based | • | scmap-cluster | R, web app | Cosine, Spearman, Pearson | ✓ | ✓ | ||||
| • | scmap-cell | R, web app | Cosine distance based kNN | ✓ | ✓ | |||||
| • | SingleR | R | Spearman | score | ✓ | Harmonization of the labels allows for multiple reference datasets. | ||||
| • | CHETAH | R, Shiny app | Spearman + confidence | ✓ | ✓ | |||||
| • | scMatch | Python | Spearman, Pearson | score | ✓ | Cell lineage is added as lower level of classification. | ||||
| • | ClustifyR | R | Spearman, Pearson, Kendall, cosine | ✓ | Implements a consensus correlation score. | |||||
| • | CIPR | R, Shiny app | Dot product, Spearman, Pearson | score | Dot product implicitly involves feature selection. | |||||
| Supervised classification-based | • | CaSTLe | R | XGBoost classifier | ✓ | |||||
| • | Moana | Python | ||||||||
| • | LAmbDA | Python | Multiple ML techniques | ✓ | Training on multiple datasets to create a shared representation of the labels to smooth batch effects | |||||
| • | superCT | Web app | Artificial Neural Network | ✓ | Classifier trained on MCA with the possibility to add user defined datasets | |||||
| • | SingleCellNet | R | Random Forest | score | Similarity scores allow to find transition states and multiple identities in the same cell. | |||||
| • | Garnett | R | Elastic net regression | ✓ | ✓ | ✓ | Classification can be done using open chromatin information derived from scATAC-seq | |||
| • | scPred | R | SVM | ✓ | Allows to train different classifiers for defined labels | |||||
| • | ACTINN | Python | Artificial Neural Network | Robust against batch effects induced by sequencing technologies | ||||||
| • | OnClass | Python | ✓ | ✓ | ✓ | Use of a CellOntology to impute labels not present in the training data. | ||||
| • | scClassify | R, Shiny app | Weighted | ✓ | ✓ | ✓ | Hierarchical cell type tree as reference. It combines six similarity matrices with five feature selection methods. | |||
| Others | • | scANVI | Python | kNN classifier | ||||||
| • | Capybara | R | Quadratic programming | score | Cell engineering-oriented | |||||
| • | scID | R | Fisher’s Linear Discriminant Analysis | ✓ | ||||||
| • | scNym | Python | Adversarial Neural Network | ✓ | ✓ | |||||
Fig 1Approaches for cell type annotation of scRNA-seq datasets. scRNA-seq datasets can be automatically annotated by tools implementing one of three approaches: annotation by marker gene databases; correlation-based methods; and annotation by supervised classification. The task of annotating a query scRNA-seq dataset consists of assigning a cell type identity to each one of the query single cells, or to a group of cells at once i.e. an unbiasedly calculated cluster. (A) Marker gene database-based annotation takes advantage of cell type atlases. Literature- and scRNA-seq analysis-derived markers have been assembled into reference cell type hierarchies and marker lists. In this approach, basic scoring systems are used to ascribe cell types at the cluster level in the query dataset. (B) Correlation-based methods make use of multiple correlation measures to compare gene expression profiles between a reference and a query dataset, at either single-cell or cluster level, by the use of centroids (pseudo-cells obtained by averaging the single-cell gene expression level of an entire cluster). Some of these tools assemble a reference of cell type gene-expression profiles from an ensemble of published studies and bulk RNA data repositories. The annotation step in this approach consists of finding the reference cell type that best correlates to the query cell or cluster, and every tool uses multiple steps for accurately finding the best match. (C) Annotation by supervised classification uses machine learning techniques for training a classifier on reference labeled scRNA-seq datasets. The classifier is subsequently applied to the query. Supervised learning is a powerful tool for building a model distribution of training labels as a function of features. Machine learning techniques offer a variety of alternatives in the training step and allow for hierarchical classification, which permits a more biologically-relevant identification of cell types.
Publicly available repositories and datasets used by automated annotation tools.
| Data type | Species | Info | Tissues/cell types | Ref | |
|---|---|---|---|---|---|
| Human Primary Cell Atlas | Microarray | Human | Cell type profiles | Cell lines, tissues, primary cells | |
| Blueprint | Bulk RNAseq | Human | Cell type profiles | Cell lines, tissues, primary cells | |
| FANTOM5 | Bulk RNAseq | Human, Mouse, rat, dog and chicken | Cell type profiles | 15 cell types | |
| Encode | Bulk RNAseq | Human, Mouse, Fly and Worm | Cell type profiles | Cell lines, tissues, primary cells | |
| HCA | Single cell RNAseq | Human | Multi-organ datasets | 33 organs | |
| MCA | Single cell RNAseq | Mouse | Multi-organ dataset | 98 major cell types | |
| Tabula Muris | Single cell RNAseq | Mouse | Multi-organ datasets | 20 organs and tissues | |
| Allen Brain Atlas | Single nuclei RNAseq | Human and Mouse | Brain datasets | 69 neuronal cell types | |
| CellMaker | Marker genes | Human and Mouse | Marker Database | 467 (human), 389 (mouse) | |
| PanglaoDB | Marker genes | Human | Marker Database | 155 cell types | |
| CancerSEA | Marker genes | Human cancer | Marker Database | 14 cancer functional states |
Classification challenges benchmarked in the original publication of each tool.
| Tool name | Pancreas | PBMC | BMDC | CBMC | Lung | Kidney | Liver | Retina | Brain | Differetiating/ transistioning cells | Whole organism | Tumor cells | Cross-species | Cross-platform | Ref |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| scCATCH | |||||||||||||||
| SCSA | |||||||||||||||
| SCINA | |||||||||||||||
| CellAssign | |||||||||||||||
| scmap-cluster | |||||||||||||||
| scmap-cell | |||||||||||||||
| SingleR | |||||||||||||||
| CHETAH | |||||||||||||||
| scMatch | |||||||||||||||
| ClustifyR | |||||||||||||||
| CIPR | |||||||||||||||
| CaSTLe | |||||||||||||||
| Moana | |||||||||||||||
| LAmbDA | |||||||||||||||
| superCT | |||||||||||||||
| SingleCellNet | |||||||||||||||
| Garnett | |||||||||||||||
| scPred | |||||||||||||||
| ACTINN | |||||||||||||||
| OnClass | |||||||||||||||
| scClassify | |||||||||||||||
| scANVI | |||||||||||||||
| Capybara | |||||||||||||||
| scID | |||||||||||||||
| scNym |