Literature DB >> 32845323

STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data.

Massimo Andreatta^1,2,3, Santiago J Carmona^1,2,3.

Abstract

SUMMARY: STACAS is a computational method for the identification of integration anchors in the Seurat environment, optimized for the integration of single-cell (sc) RNA-seq datasets that share only a subset of cell types. We demonstrate that by (i) correcting batch effects while preserving relevant biological variability across datasets, (ii) filtering aberrant integration anchors with a quantitative distance measure and (iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations.
AVAILABILITY AND IMPLEMENTATION: Source code and R package available at https://github.com/carmonalab/STACAS; Docker image available at https://hub.docker.com/repository/docker/mandrea1/stacas_demo.

Entities: CellLine Chemical Disease Gene Species

Year: 2021 PMID： 32845323 PMCID： PMC8098019 DOI： 10.1093/bioinformatics/btaa755

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Massively parallel single-cell transcriptomics (scRNA-seq) has emerged as a transformative technology that enables measuring molecular profiles at single-cell resolution. However, despite the highly multiplexed technologies, single-cell data are produced separately for different tissues and organs and are affected by multiple batch effects, such as different sample processing and scRNA-seq protocols. As such, integration of single-cell data might be the ultimate challenge in the field towards the generation of single-cell atlases (Eisenstein, 2020; Lähnemann ; Regev ). Seurat (Stuart ) is currently one of the most popular and best performing algorithms for single-cell data integration, and can be effortlessly integrated into complex analysis pipelines (Tran ). At the core of the Seurat integration algorithm is the identification of mutual nearest neighbors (MNN) across single-cell datasets, named ‘anchors’, in a reduced space obtained from canonical correlation analysis (CCA). These anchors and their scores are used to compute correction vectors for each query cell, transforming (i.e. batch-correcting) its expression profile (Haghverdi ). Transformed cell profiles can then be jointly analyzed as part of an integrated space. To handle more than two datasets, a guide tree based on pairwise batch similarities is used to dictate the batch integration order. While Seurat has proven very powerful for the removal of technical artifacts between replicated experiments or even different sequencing technologies (Tran ), it tends to overcorrect batch effects and performs poorly when integrating heterogeneous datasets (Luecken ), where only a fraction of cell types are shared between individual samples. This is crucial for the creation of reference cell type-specific single-cell atlases where the datasets to integrate were obtained from different tissues or experimental conditions (e.g. T cells from blood versus tumor-infiltrating T cells), and as a consequence are composed of different, partially overlapping cell states or sub-types.

2 Results

STACAS is a package for determining integration anchors between heterogeneous datasets, and it is designed to be easily incorporated into Seurat dataset integration pipelines. STACAS uses a reciprocal principal component analysis (PCA) procedure to calculate anchors, where each dataset in a pair is projected onto the reduced PCA space of the other dataset; mutual nearest neighbors are then calculated in these reduced spaces. Crucially, and in contrast to the CCA reduction used by Seurat, the expression values of genes used in generating the PCA spaces are not rescaled to have zero mean and unit variance. When integrating heterogeneous datasets, for instance composed only of CD4+ or CD8+ T cells, such rescaling can cancel out important biological differences between the datasets (Fig. 1A).

Fig. 1.

Anchor finding and dataset integration using STACAS. (A) Expression level (log [ normalized UMI counts + 1]) of Cd8a and Cd4 after integration with Seurat CCA (top) or STACAS (bottom); important biological differences between the samples are lost by data rescaling and sub-optimal anchoring by Seurat 3 CCA. (B) Anchor distance distribution between pairs of samples prior to anchor filtering by STACAS; poor anchors with distance higher than threshold (represented with a vertical dashed line) are filtered out by STACAS. (C–E) Low-dimensionality UMAP visualization of scRNA-seq data, colored by sample, without batch correction (C), using Seurat CCA anchors (D) and using STACAS anchors (E) for dataset alignment. (F–H) UMAP visualization of scRNA-seq data, colored by TILPRED state prediction, without batch correction (F), using Seurat CCA anchors (G) and using STACAS anchors (H) for dataset alignment A second innovation introduced in STACAS is the filtering of anchors based on anchor pairwise distance, which is calculated on the reduced PCA spaces used to determine the anchors. We observed that the distribution of anchor distances between datasets with shared cell subtypes (i.e. a sample containing both CD4+ and CD8+ T cells, compared to a sample of CD8+ T cells only) is centered on lower pairwise anchor distances compared with dataset pairs with limited or no overlap (e.g. a CD4+ sample and a CD8+ sample) (Fig. 1B); anchor distance can therefore be used as a quantitative measure to filter spurious anchors and improve dataset integration. In STACAS, the anchor filtering threshold defaults to the 80th percentile of the distance distribution between the two most similar datasets included in the integration task. Finally, the anchors determined by STACAS can be used directly for dataset integration using the IntegrateData function in Seurat 3. STACAS suggests a guide tree to determine the order in which datasets are to be integrated. In contrast to the Seurat default guide tree, which favors datasets with the highest total number of cells in any given pair, STACAS prioritizes samples with the highest total number of anchors; the rationale being that datasets with many anchors are likely to contain more cell types and represent the ‘centroid’ of the integrated map. In the example in Figure 1, we integrated four scRNA-seq datasets of mouse T cells from public repositories, composed of (i) CD8+ tumor-infiltrating lymphocytes (TILs) (Carmona ); (ii) CD4+ and CD8+ TILs (Xiong ); (iii) CD4+ T cells from tumors (Magen ) and (iv) CD4+ T cells from tumor-draining lymph nodes (dLN) (Magen ). There is an evident batch effect between the samples, with the cells of each sample clustering together regardless of their type (Fig. 1C and F). Consistently with a recent benchmark (Luecken ), dataset alignment using Seurat 3 appears to overcorrect these batch effects, overlaying samples with little in common such as CD4+ dLN and CD8+ TILs (Fig. 1D and G). In contrast, STACAS only aligns cells with similar states across samples, limiting the superposition of CD4+ with CD8+ cells (Fig. 1E). Supervised cell state classification using TILPRED (Carmona ) confirms that in most cases STACAS was able to cluster cell types across different, heterogeneous datasets (Fig. 1H). We obtained similar, consistent results on larger-scale integration tasks toward the construction of reference T cell maps in cancer and chronic infection (Andreatta ). An interactive TIL reference atlas constructed using STACAS can be explored at:

Funding

This research was supported by the Swiss National Science Foundation (SNF) Ambizione [180010 to S.J.C.]. Conflict of Interest: none declared.

Data availability

The data analyzed in this article are publicly available from NCBI Gene Expression Omnibus (GEO) at https://www.ncbi.nlm.nih.gov/geo/ under the identifiers GSE124691 and GSE116390, and from EMBL-EBI ArrayExpress at https://www.ebi.ac.uk/arrayexpress/ under entry E-MTAB-7919.

8 in total

1. Comprehensive Integration of Single-Cell Data.

Authors: Tim Stuart; Andrew Butler; Paul Hoffman; Christoph Hafemeister; Efthymia Papalexi; William M Mauck; Yuhan Hao; Marlon Stoeckius; Peter Smibert; Rahul Satija
Journal: Cell Date: 2019-06-06 Impact factor: 41.582

2. Single-cell RNA-seq analysis software providers scramble to offer solutions.

Authors: Michael Eisenstein
Journal: Nat Biotechnol Date: 2020-03 Impact factor: 54.908

3. Coexpression of Inhibitory Receptors Enriches for Activated and Functional CD8⁺ T Cells in Murine Syngeneic Tumor Models.

Authors: Huizhong Xiong; Stephanie Mittman; Ryan Rodriguez; Patricia Pacheco-Sanchez; Marina Moskalenko; Yagai Yang; Justin Elstrott; Alex T Ritter; Sören Müller; Dorothee Nickles; Teresita L Arenzana; Aude-Hélène Capietto; Lélia Delamarre; Zora Modrusan; Sascha Rutz; Ira Mellman; Rafael Cubas
Journal: Cancer Immunol Res Date: 2019-05-07 Impact factor: 11.151

4. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.

Authors: Laleh Haghverdi; Aaron T L Lun; Michael D Morgan; John C Marioni
Journal: Nat Biotechnol Date: 2018-04-02 Impact factor: 54.908

Review 5. Eleven grand challenges in single-cell data science.

Authors: David Lähnemann; Johannes Köster; Ewa Szczurek; Davis J McCarthy; Stephanie C Hicks; Mark D Robinson; Catalina A Vallejos; Kieran R Campbell; Niko Beerenwinkel; Ahmed Mahfouz; Luca Pinello; Pavel Skums; Alexandros Stamatakis; Camille Stephan-Otto Attolini; Samuel Aparicio; Jasmijn Baaijens; Marleen Balvert; Buys de Barbanson; Antonio Cappuccio; Giacomo Corleone; Bas E Dutilh; Maria Florescu; Victor Guryev; Rens Holmer; Katharina Jahn; Thamar Jessurun Lobo; Emma M Keizer; Indu Khatri; Szymon M Kielbasa; Jan O Korbel; Alexey M Kozlov; Tzu-Hao Kuo; Boudewijn P F Lelieveldt; Ion I Mandoiu; John C Marioni; Tobias Marschall; Felix Mölder; Amir Niknejad; Lukasz Raczkowski; Marcel Reinders; Jeroen de Ridder; Antoine-Emmanuel Saliba; Antonios Somarakis; Oliver Stegle; Fabian J Theis; Huan Yang; Alex Zelikovsky; Alice C McHardy; Benjamin J Raphael; Sohrab P Shah; Alexander Schönhuth
Journal: Genome Biol Date: 2020-02-07 Impact factor: 13.583

6. Deciphering the transcriptomic landscape of tumor-infiltrating CD8 lymphocytes in B16 melanoma tumors with single-cell RNA-Seq.

Authors: Santiago J Carmona; Imran Siddiqui; Mariia Bilous; Werner Held; David Gfeller
Journal: Oncoimmunology Date: 2020-03-12 Impact factor: 8.110

7. Single-Cell Profiling Defines Transcriptomic Signatures Specific to Tumor-Reactive versus Virus-Responsive CD4⁺ T Cells.

Authors: Assaf Magen; Jia Nie; Thomas Ciucci; Samira Tamoutounour; Yongmei Zhao; Monika Mehta; Bao Tran; Dorian B McGavern; Sridhar Hannenhalli; Rémy Bosselut
Journal: Cell Rep Date: 2019-12-03 Impact factor: 9.423

8. A benchmark of batch-effect correction methods for single-cell RNA sequencing data.

Authors: Hoa Thi Nhu Tran; Kok Siong Ang; Marion Chevrier; Xiaomeng Zhang; Nicole Yee Shin Lee; Michelle Goh; Jinmiao Chen
Journal: Genome Biol Date: 2020-01-16 Impact factor: 13.583

8 in total

3 in total

1. A CD4⁺ T cell reference map delineates subtype-specific adaptation during acute and chronic viral infections.

Authors: Thomas Ciucci; Santiago J Carmona; Massimo Andreatta; Ariel Tjitropranoto; Zachary Sherman; Michael C Kelly
Journal: Elife Date: 2022-07-13 Impact factor: 8.713

2. Application of single-cell transcriptomics to kinetoplastid research.

Authors: Emma M Briggs; Felix S L Warren; Keith R Matthews; Richard McCulloch; Thomas D Otto
Journal: Parasitology Date: 2021-03-08 Impact factor: 3.234

3. Single cell and spatial transcriptomic analyses reveal microglia-plasma cell crosstalk in the brain during Trypanosoma brucei infection.

Authors: Juan F Quintana; Praveena Chandrasegaran; Matthew C Sinton; Emma M Briggs; Thomas D Otto; Rhiannon Heslop; Calum Bentley-Abbot; Colin Loney; Luis de Lecea; Neil A Mabbott; Annette MacLeod
Journal: Nat Commun Date: 2022-09-30 Impact factor: 17.694

3 in total