Literature DB >> 35258562

scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets.

Massimo Andreatta1,2, Ariel J Berenstein3, Santiago J Carmona1,2.   

Abstract

 : A common bioinformatics task in single-cell data analysis is to purify a cell type or cell population of interest from heterogeneous datasets. Here we present scGate, an algorithm that automatizes marker-based purification of specific cell populations, without requiring training data or reference gene expression profiles. scGate purifies a cell population of interest using a set of markers organized in a hierarchical structure, akin to gating strategies employed in flow cytometry. scGate outperforms state-of-the-art single-cell classifiers and it can be applied to multiple modalities of single-cell data (e.g. RNA-seq, ATAC-seq, CITE-seq). scGate is implemented as an R package and integrated with the Seurat framework, providing an intuitive tool to isolate cell populations of interest from heterogeneous single-cell datasets. AVAILABILITY: R package source code and reproducible tutorials are available at https://github.com/carmonalab/scGate. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Year:  2022        PMID: 35258562      PMCID: PMC9048671          DOI: 10.1093/bioinformatics/btac141

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Single-cell RNA sequencing (scRNA-seq) is becoming increasingly popular, enabling high-throughput exploration of cell types and states from complex tissues. Cell types are generally defined based on biological function and the markers used to physically isolate them, but these can change depending on the source tissue. In scRNA-seq data analysis, knowledge of cell type-defining marker genes is typically used to manually identify relevant cell populations within custom bioinformatics workflows, requiring several steps and parameters. Alternatively, when high-quality transcriptomic profiles are available for the cell type of interest, training multinomial machine learning classifiers to predict cell type identity has been shown to be a powerful approach (Abdelaal ; Pasquini ). For example, popular tools such as SingleR perform well when trained on high-quality bulk RNA-seq gene expression profiles of sorted cell populations (Huang ). However, reliable reference transcriptional profiles are not always available. Moreover, batch effects and other biases are difficult to assess in training datasets, which can lead to overfitting and biased predictions. In this work, we developed an intuitive tool to purify a cell population of interest from complex scRNA-seq datasets based on literature-derived marker genes, without requiring reference gene expression profiles or training data. With scGate, an expert can purify a cell population of interest from a complex scRNA-seq dataset by only defining a few marker genes, or by using sets of markers distributed with the scGate package. This provides a straightforward and complementary approach to machine learning-based classifiers, aimed at automatizing current practices in marker-based purification of cell types from single-cell transcriptomics datasets.

2 Results

scGate is an R package that automatizes the typically manual task of marker-based cell type annotation, to enable accurate and intuitive purification of a cell population of interest from a complex scRNA-seq dataset (for instance, a dataset of blood-derived immune cells, Fig. 1A). scGate builds upon our recent method UCell (Andreatta and Carmona, 2021) for robust single-cell signature scoring, and Seurat, a comprehensive and powerful framework for single-cell omics analysis (Hao ).
Fig. 1.

Purifying cell populations from single-cell datasets using scGate. (A) Uniform Manifold Approximation and Projection (UMAP) representation of scRNA-seq data of PBMC populations annotated by Hao  (B) Purification of target cell types using scGate, for B cells on the left (using marker MS4A1 [encoding CD20]) and NK on the right (using NCAM [encoding CD56] and KLRD1 as positive markers, and CD3D as a negative marker). The violin plots display normalized ADT counts for the indicated proteins on the same cells. Precision (PREC), recall (REC) and MCC are shown. (C) UMAP representation of scRNA-seq data of melanoma tumors annotated by Jerby-Arnon (D) Purification of macrophages using a hierarchical GM: immune cells at the first level (left panel) and macrophages at the second level (middle panel). Macrophage gene signature (UCell) scores are shown in the right panel. (E) scGate purification of monocytes using DNA accessibility of a PBMC 10× multiomics dataset. Violin plots display coupled RNA expression values. Gene-associated accessibility values were inferred using Signac (Stuart ). (F) PREC (Positive Predictive Value) and MCC values for five publicly available scRNA-seq datasets (derived from blood or tumors) for scGate and three other cell type classifiers

Purifying cell populations from single-cell datasets using scGate. (A) Uniform Manifold Approximation and Projection (UMAP) representation of scRNA-seq data of PBMC populations annotated by Hao  (B) Purification of target cell types using scGate, for B cells on the left (using marker MS4A1 [encoding CD20]) and NK on the right (using NCAM [encoding CD56] and KLRD1 as positive markers, and CD3D as a negative marker). The violin plots display normalized ADT counts for the indicated proteins on the same cells. Precision (PREC), recall (REC) and MCC are shown. (C) UMAP representation of scRNA-seq data of melanoma tumors annotated by Jerby-Arnon (D) Purification of macrophages using a hierarchical GM: immune cells at the first level (left panel) and macrophages at the second level (middle panel). Macrophage gene signature (UCell) scores are shown in the right panel. (E) scGate purification of monocytes using DNA accessibility of a PBMC 10× multiomics dataset. Violin plots display coupled RNA expression values. Gene-associated accessibility values were inferred using Signac (Stuart ). (F) PREC (Positive Predictive Value) and MCC values for five publicly available scRNA-seq datasets (derived from blood or tumors) for scGate and three other cell type classifiers Briefly, scGate takes as input: (i) a gene expression matrix or Seurat object and (ii) a ‘gating model’ (GM), consisting of a set of marker genes that define the cell population of interest. The GM can be as simple as a single marker gene, or a combination of positive and negative markers. For example, the marker MS4A1 (encoding CD20) alone allows purifying B cells with 99.7% precision and 99.7% recall (Fig. 1B, left panel). A model that requires NCAM1 and KLRD1 but absence of CD3D purifies natural killer (NK) cells with 100% precision and 95% recall (Fig. 1B, right panel). In both cases, antibody-derived tags (ADT) from the same cells confirm a correct purification at the protein level (Fig. 1B). More complex GMs can be constructed in a hierarchical fashion. For instance, macrophages can be purified from a complex tissue such as melanoma tumors (Fig. 1C) by defining a two-level hierarchical GM. The first level gates on immune cells using pan-immune cell marker PTPRC encoding CD45, and subsequently the second level purifies macrophages from immune cells using the markers CD68 and FCGR1A (Fig. 1D). Our algorithm evaluates the strength of the marker activity in each cell using the rank-based method UCell, and then performs k-nearest neighbor (kNN) smoothing by calculating the mean UCell score across neighboring cells. By kNN-smoothing, scGate aims at mitigating the large degree of sparsity in single-cell omics data. Finally, a fixed threshold over kNN-smoothed signature scores is applied in binary decision trees generated from the user-provided GM (e.g. Fig. 1D and Supplementary Fig. S1A), to annotate cells as either ‘pure’ or ‘impure’ with respect to the cell population of interest. The intuitive and flexible design of scGate allows for positive and negative markers and sequential/hierarchical gating strategies, providing users a quick and simple, yet powerful tool to purify cell populations of interest from arbitrarily complex datasets. For example, a simple two-gene signature (Foxi+ Cftr+) was sufficient to accurately isolate rare pulmonary ionocytes (Montoro ) (Supplementary Fig. S1B). Each of the purifications shown in Figure 1B required just one line of R code within a Seurat workflow, for instance, to purify NK cells: scGate(seurat_object, model=gating_model(name="NK", signature=c("NCAM1","KLRD1","CD3D-")) scGate can also be applied to single-cell modalities other than RNA-seq. On a multi-omics scATAC + scRNA-seq dataset of human peripheral blood mononuclear cells (PBMCs), scGate was able to successfully isolate CD14+ monocytes, T cells, NK cells and B cells from DNA accessibility data (Fig. 1E and Supplementary Fig. S1C). In a converse experiment to Figure 1B, scGate was applied to ADT counts, confirming the accuracy of target cell type isolation by the paired scRNA-seq readouts (Supplementary Fig. S1D). scGate comes with pre-defined GMs based on commonly used markers of immune cell types in human and mouse, such as T cells, B cells, NK cells, myeloid cell populations, among others. With these marker sets and five author-annotated published datasets from blood or tumors (Abdelaal ; Hao ; Jerby-Arnon ; Yost ; Zilionis ), we compared the predictive performance of scGate against three popular single-cell classifiers: SingleR (Aran ), SCINA (Zhang ) and Garnett (Pliner ). Of note, SingleR and Garnett are supervised classifiers and require reference gene expression profiles for training. For SingleR, we used the recommended HPCA dataset for training and other parameters by default; for Garnett, we applied the pre-trained PBMC classifier provided by the authors. SCINA is marked-based, but no database of signatures is provided by the authors; we adapted scGate signatures to be compatible with SCINA. Across the board, scGate outperformed competing methods for the isolation of target cell types (Fig. 1F). When compared with the second-best methods in terms of predictive performance, scGate had superior mean precision than SCINA (0.88 versus 0.83, paired Wilcoxon test P-value = 5.9 × 10−4) and higher Matthews correlation coefficient (MCC) than SingleR (0.81 versus 0.75, paired Wilcoxon test P-value = 6.7 × 10−3). Multiple predefined scGate models are provided in a public repository as version-controlled tab-separated text, allowing scGate to automatically synchronize its internal database of GMs. Users can manually edit these models and easily write their own. scGate also provides functions to evaluate the performance of custom GMs, either user-provided or those that accompany the package, on a set of pre-annotated testing datasets. Overall, scGate provides an accurate, scalable and intuitive tool to isolate cell populations of interest that can be seamlessly incorporated into Seurat pipelines for single-cell data analysis.

Funding

This work was supported by the Swiss National Science Foundation (SNF) Ambizione [180010 to S.J.C.]. A.J.B. was supported by the National Scientific and Technical Research Council of Argentina (CONICET). Conflict of Interest: none declared. Click here for additional data file.
  13 in total

1.  A Cancer Cell Program Promotes T Cell Exclusion and Resistance to Checkpoint Blockade.

Authors:  Livnat Jerby-Arnon; Parin Shah; Michael S Cuoco; Christopher Rodman; Mei-Ju Su; Johannes C Melms; Rachel Leeson; Abhay Kanodia; Shaolin Mei; Jia-Ren Lin; Shu Wang; Bokang Rabasha; David Liu; Gao Zhang; Claire Margolais; Orr Ashenberg; Patrick A Ott; Elizabeth I Buchbinder; Rizwan Haq; F Stephen Hodi; Genevieve M Boland; Ryan J Sullivan; Dennie T Frederick; Benchun Miao; Tabea Moll; Keith T Flaherty; Meenhard Herlyn; Russell W Jenkins; Rohit Thummalapalli; Monika S Kowalczyk; Israel Cañadas; Bastian Schilling; Adam N R Cartwright; Adrienne M Luoma; Shruti Malu; Patrick Hwu; Chantale Bernatchez; Marie-Andrée Forget; David A Barbie; Alex K Shalek; Itay Tirosh; Peter K Sorger; Kai Wucherpfennig; Eliezer M Van Allen; Dirk Schadendorf; Bruce E Johnson; Asaf Rotem; Orit Rozenblatt-Rosen; Levi A Garraway; Charles H Yoon; Benjamin Izar; Aviv Regev
Journal:  Cell       Date:  2018-11-01       Impact factor: 41.582

2.  Single-cell chromatin state analysis with Signac.

Authors:  Tim Stuart; Avi Srivastava; Shaista Madad; Caleb A Lareau; Rahul Satija
Journal:  Nat Methods       Date:  2021-11-01       Impact factor: 47.990

3.  Clonal replacement of tumor-specific T cells following PD-1 blockade.

Authors:  Kathryn E Yost; Ansuman T Satpathy; Daniel K Wells; Yanyan Qi; Chunlin Wang; Robin Kageyama; Katherine L McNamara; Jeffrey M Granja; Kavita Y Sarin; Ryanne A Brown; Rohit K Gupta; Christina Curtis; Samantha L Bucktrout; Mark M Davis; Anne Lynn S Chang; Howard Y Chang
Journal:  Nat Med       Date:  2019-07-29       Impact factor: 53.440

4.  UCell: Robust and scalable single-cell gene signature scoring.

Authors:  Massimo Andreatta; Santiago J Carmona
Journal:  Comput Struct Biotechnol J       Date:  2021-06-30       Impact factor: 7.271

5.  Integrated analysis of multimodal single-cell data.

Authors:  Yuhan Hao; Stephanie Hao; Erica Andersen-Nissen; William M Mauck; Shiwei Zheng; Andrew Butler; Maddie J Lee; Aaron J Wilk; Charlotte Darby; Michael Zager; Paul Hoffman; Marlon Stoeckius; Efthymia Papalexi; Eleni P Mimitou; Jaison Jain; Avi Srivastava; Tim Stuart; Lamar M Fleming; Bertrand Yeung; Angela J Rogers; Juliana M McElrath; Catherine A Blish; Raphael Gottardo; Peter Smibert; Rahul Satija
Journal:  Cell       Date:  2021-05-31       Impact factor: 41.582

6.  Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.

Authors:  Dvir Aran; Agnieszka P Looney; Leqian Liu; Esther Wu; Valerie Fong; Austin Hsu; Suzanna Chak; Ram P Naikawadi; Paul J Wolters; Adam R Abate; Atul J Butte; Mallar Bhattacharya
Journal:  Nat Immunol       Date:  2019-01-14       Impact factor: 25.606

7.  A comparison of automatic cell identification methods for single-cell RNA sequencing data.

Authors:  Tamim Abdelaal; Lieke Michielsen; Davy Cats; Dylan Hoogduin; Hailiang Mei; Marcel J T Reinders; Ahmed Mahfouz
Journal:  Genome Biol       Date:  2019-09-09       Impact factor: 13.583

8.  Supervised classification enables rapid annotation of cell atlases.

Authors:  Hannah A Pliner; Jay Shendure; Cole Trapnell
Journal:  Nat Methods       Date:  2019-09-09       Impact factor: 28.547

Review 9.  Automated methods for cell type annotation on scRNA-seq data.

Authors:  Giovanni Pasquini; Jesus Eduardo Rojo Arias; Patrick Schäfer; Volker Busskamp
Journal:  Comput Struct Biotechnol J       Date:  2021-01-19       Impact factor: 7.271

10.  Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data.

Authors:  Qianhui Huang; Yu Liu; Yuheng Du; Lana X Garmire
Journal:  Genomics Proteomics Bioinformatics       Date:  2020-12-24       Impact factor: 7.691

View more
  2 in total

1.  A CD4+ T cell reference map delineates subtype-specific adaptation during acute and chronic viral infections.

Authors:  Thomas Ciucci; Santiago J Carmona; Massimo Andreatta; Ariel Tjitropranoto; Zachary Sherman; Michael C Kelly
Journal:  Elife       Date:  2022-07-13       Impact factor: 8.713

Review 2.  Computational solutions for spatial transcriptomics.

Authors:  Iivari Kleino; Paulina Frolovaitė; Tomi Suomi; Laura L Elo
Journal:  Comput Struct Biotechnol J       Date:  2022-09-01       Impact factor: 6.155

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.