Literature DB >> 34285779

UCell: Robust and scalable single-cell gene signature scoring.

Massimo Andreatta1,2, Santiago J Carmona1,2.   

Abstract

UCell is an R package for evaluating gene signatures in single-cell datasets. UCell signature scores, based on the Mann-Whitney U statistic, are robust to dataset size and heterogeneity, and their calculation demands less computing time and memory than other available methods, enabling the processing of large datasets in a few minutes even on machines with limited computing power. UCell can be applied to any single-cell data matrix, and includes functions to directly interact with Seurat objects. The UCell package and documentation are available on GitHub at https://github.com/carmonalab/UCell.
© 2021 The Author(s).

Entities:  

Keywords:  Cell type; Gene set enrichment; Gene signature; Module scoring; Single-cell

Year:  2021        PMID: 34285779      PMCID: PMC8271111          DOI: 10.1016/j.csbj.2021.06.043

Source DB:  PubMed          Journal:  Comput Struct Biotechnol J        ISSN: 2001-0370            Impact factor:   7.271


Introduction

In single-cell RNA-seq analysis, gene signature (or “module”) scoring constitutes a simple yet powerful approach to evaluate the strength of biological signals, typically associated to a specific cell type or biological process, in a transcriptome. Thousands of gene sets have been derived by measuring transcriptional differences between different biological states or cell phenotypes, and are collected in public databases such as MSigDB [1]. More recently, large-scale efforts to construct single-cell atlases [2], [3] are providing specific gene sets that can be useful to discriminate between cell types. For example, Han et al. have used single-cell RNA sequencing to quantify cell type heterogeneity in different tissues and to define gene signatures for >100 human and murine cell types [3]. Given such a gene set, signature scoring aims at quantifying the activity of the genes in the set, with the goal to characterize cell types, states, active biological processes or responses to environmental cues. The Seurat R package [4] is one of the most comprehensive and widely used frameworks for scRNA-seq data analysis. Seurat provides a computationally efficient gene signature scoring function, named AddModuleScore, originally proposed by Tirosh et al. [5]. However, because genes are binned based on their average expression across the whole dataset for normalization purposes, the method generates inconsistent results for the same cell depending on the composition of the dataset. Inspired by the AUCell algorithm implemented in SCENIC [6], we propose UCell, a gene signature scoring method based on the Mann-Whitney U statistic. UCell scores depend only on the relative gene expression in individual cells and are therefore not affected by dataset composition. We provide a time- and memory-efficient implementation of the algorithm that can be seamlessly incorporated into Seurat workflows.

Methods

UCell calculates gene signature scores for scRNA-seq data based on the Mann-Whitney U statistic [7]. Given a g × c matrix of numerical values (e.g. gene expression measurements) for g genes in c cells, we first calculate the matrix of relative ranks by sorting each column in ; in other words, we calculate a ranked list of genes for each cell in the dataset. Because in scRNA-seq not all molecules in the original sample are observed, transcript counts matrices contain many zeros, resulting in a long tail of bottom-ranking genes. To mitigate this uninformative tail, we set r = r + 1 for all r > r, with r = 1500 by default (matching typical thresholds used for quality control for minimum number of genes detected). To evaluate a gene signature composed of n genes (s…,s), we calculate a UCell score U’ for each cell j in with the formula:where U is the Mann-Whitney U statistic calculated by:and is obtained by sub-setting on the genes in signature . We note that the U statistic is closely related to the area-under-the-curve (AUC) statistic for ROC curves [8], therefore we expect UCell scores to correlate with methods based on AUC scores such as AUCell [6]. Internally, UCell uses the frank function from the data.table package [9] for efficient ranks computations. Large datasets are automatically split into batches of reduced size, which can be processed serially (minimizing memory usage) or in parallel through the future package [10] (minimizing execution time) depending on the available computational resources.

Results

UCell is an R package for the evaluation of gene signature enrichment designed for scRNA-seq data. Given a gene expression matrix or Seurat object, and a list of gene sets, UCell calculates signature scores for each cell, for each gene set. In the following illustrative example, we applied UCell to a single-cell multimodal dataset of human blood T cells [11], which were annotated by the authors using both gene (scRNA-seq) and cell surface marker expression (CITE-seq) (Fig. 1A). Provided a set of T cell subtype-specific genes (Table 1), UCell helps interpreting clusters in terms of signature enrichment in low-dimensional spaces such as the UMAP (Fig. 1B). Importantly, UCell scores are based on the relative ranking of genes for individual cells, therefore they are robust to dataset composition. Evaluating a CD8 T cell signature on the full dataset or on CD8 T cells only, results in identical score distributions for CD8 T cells in the two settings (Fig. 1C). Conversely, AddModuleScore from Seurat normalizes its scores against the average expression of a control set of genes across the whole dataset, and is therefore dependent on dataset composition. CD8 T cells analyzed in isolation or in the context of the full T cell dataset are assigned highly different AddModuleScore scores, with median ~1 in the full dataset and median ~0 for the CD8 T cell subset (Fig. 1D). Another widely-used method for single-cell signature scoring, AUCell [6], is also based on relative rankings and therefore has the same desirable property as UCell of reporting consistent scores regardless of dataset composition. Compared to AUCell, UCell is about three times faster (Fig. 1E) and uses significantly less memory (Fig. 1F). For example, AUCell requires over 64 GB of RAM to process 100,000 single-cells, while UCell uses only 5.5 GB of peak memory (Fig. 1F), making it suitable even for machines with limited computing power.
Fig. 1

Evaluating T cell signatures using UCell. A) UMAP representation of T subsets from the single-cell dataset by Hao et al. [11]. B) UCell score distribution in UMAP space for five gene signatures (listed in Table 1) evaluated using UCell. C-D) Comparison of UCell score (C) and Seurat’s AddModuleScore (D) distributions for a two-gene CD8 T cell signature (CD8A, CD8B), evaluated on the complete T cell dataset (black outlines), or on the subset of CD8 T cells only (red outlines); UCell scores for CD8 T cell have the same distribution in the complete or subset dataset, while AddModuleScores are highly dependent on dataset composition. E-F) Running time (E) and peak memory (F) for UCell and AUCell (which produces similar results) on datasets of different sizes show that UCell is about three times faster and requires up to ten times less memory on large (>104) single-cell datasets. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 1

Gene signatures for T cell subsets in Fig. 1.

T cell typeGene set
CD4 T cellCD4, CD40LG
CD8 T cellCD8A, CD8B
TregFOXP3, IL2RA
MAITKLRB1, SLC4A10, NCR3
gd T cellTRDC, TRGC1, TRGC2, TRDV1
Evaluating T cell signatures using UCell. A) UMAP representation of T subsets from the single-cell dataset by Hao et al. [11]. B) UCell score distribution in UMAP space for five gene signatures (listed in Table 1) evaluated using UCell. C-D) Comparison of UCell score (C) and Seurat’s AddModuleScore (D) distributions for a two-gene CD8 T cell signature (CD8A, CD8B), evaluated on the complete T cell dataset (black outlines), or on the subset of CD8 T cells only (red outlines); UCell scores for CD8 T cell have the same distribution in the complete or subset dataset, while AddModuleScores are highly dependent on dataset composition. E-F) Running time (E) and peak memory (F) for UCell and AUCell (which produces similar results) on datasets of different sizes show that UCell is about three times faster and requires up to ten times less memory on large (>104) single-cell datasets. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Gene signatures for T cell subsets in Fig. 1. UCell is available as an R package at https://github.com/carmonalab/UCell, and is accompanied by vignettes for signature scoring and for seamless integration with Seurat pipelines. Source code to reproduce the results in this manuscript is available at the following repository: https://gitlab.unil.ch/carmona/UCell_demo.

Funding

This research was supported by the (SNF) Ambizione grant 180010 to SJC.

CRediT authorship contribution statement

Massimo Andreatta: Conceptualization, Methodology, Software, Formal analysis, Visualization, Writing - original draft, Writing - review & editing. Santiago J. Carmona: Conceptualization, Methodology, Software, Formal analysis, Writing - original draft, Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  12 in total

1.  A CD4+ T cell reference map delineates subtype-specific adaptation during acute and chronic viral infections.

Authors:  Thomas Ciucci; Santiago J Carmona; Massimo Andreatta; Ariel Tjitropranoto; Zachary Sherman; Michael C Kelly
Journal:  Elife       Date:  2022-07-13       Impact factor: 8.713

2.  Human microglia states are conserved across experimental models and regulate neural stem cell responses in chimeric organoids.

Authors:  Galina Popova; Sarah S Soliman; Chang N Kim; Matthew G Keefe; Kelsey M Hennick; Samhita Jain; Tao Li; Dario Tejera; David Shin; Bryant B Chhun; Christopher S McGinnis; Matthew Speir; Zev J Gartner; Shalin B Mehta; Maximilian Haeussler; Keith B Hengen; Richard R Ransohoff; Xianhua Piao; Tomasz J Nowakowski
Journal:  Cell Stem Cell       Date:  2021-09-17       Impact factor: 25.269

3.  A reference single-cell regulomic and transcriptomic map of cynomolgus monkeys.

Authors:  Jiao Qu; Fa Yang; Tao Zhu; Yingshuo Wang; Wen Fang; Yan Ding; Xue Zhao; Xianjia Qi; Qiangmin Xie; Ming Chen; Qiang Xu; Yicheng Xie; Yang Sun; Dijun Chen
Journal:  Nat Commun       Date:  2022-07-13       Impact factor: 17.694

4.  A single-cell atlas of the normal and malformed human brain vasculature.

Authors:  Ethan A Winkler; Chang N Kim; Jayden M Ross; Joseph H Garcia; Eugene Gil; Irene Oh; Lindsay Q Chen; David Wu; Joshua S Catapano; Kunal Raygor; Kazim Narsinh; Helen Kim; Shantel Weinsheimer; Daniel L Cooke; Brian P Walcott; Michael T Lawton; Nalin Gupta; Berislav V Zlokovic; Edward F Chang; Adib A Abla; Daniel A Lim; Tomasz J Nowakowski
Journal:  Science       Date:  2022-03-04       Impact factor: 63.714

5.  Androgen conspires with the CD8+ T cell exhaustion program and contributes to sex bias in cancer.

Authors:  Hyunwoo Kwon; Johanna M Schafer; No-Joon Song; Satoshi Kaneko; Anqi Li; Tong Xiao; Anjun Ma; Carter Allen; Komal Das; Lei Zhou; Brian Riesenberg; Yuzhou Chang; Payton Weltge; Maria Velegraki; David Y Oh; Lawrence Fong; Qin Ma; Debasish Sundi; Dongjun Chung; Xue Li; Zihai Li
Journal:  Sci Immunol       Date:  2022-07-01

6.  Longitudinal analyses reveal distinct immune response landscapes in lung and intestinal tissues from SARS-CoV-2-infected rhesus macaques.

Authors:  Huiwen Zheng; Yanli Chen; Jing Li; Heng Li; Xin Zhao; Jiali Li; Fengmei Yang; Yanyan Li; Changkun Liu; Li Qin; Yuanyuan Zuo; Qian Zhang; Zhanlong He; Haijing Shi; Qihan Li; Longding Liu
Journal:  Cell Rep       Date:  2022-05-08       Impact factor: 9.995

7.  scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets.

Authors:  Massimo Andreatta; Ariel J Berenstein; Santiago J Carmona
Journal:  Bioinformatics       Date:  2022-03-08       Impact factor: 6.937

8.  Transcriptional census of epithelial-mesenchymal plasticity in cancer.

Authors:  David P Cook; Barbara C Vanderhyden
Journal:  Sci Adv       Date:  2022-01-05       Impact factor: 14.136

9.  Dynamic single-cell RNA sequencing reveals BCG vaccination curtails SARS-CoV-2 induced disease severity and lung inflammation.

Authors:  Alok K Singh; Rulin Wang; Kara A Lombardo; Monali Praharaj; C Korin Bullen; Peter Um; Stephanie Davis; Oliver Komm; Peter B Illei; Alvaro A Ordonez; Melissa Bahr; Joy Huang; Anuj Gupta; Kevin J Psoter; Sanjay K Jain; Trinity J Bivalacqua; Srinivasan Yegnasubramanian; William R Bishai
Journal:  bioRxiv       Date:  2022-03-15

10.  Transcriptomic and proteomic retinal pigment epithelium signatures of age-related macular degeneration.

Authors:  Anne Senabouth; Maciej Daniszewski; Grace E Lidgerwood; Helena H Liang; Damián Hernández; Mehdi Mirzaei; Stacey N Keenan; Ran Zhang; Xikun Han; Drew Neavin; Louise Rooney; Maria Isabel G Lopez Sanchez; Lerna Gulluyan; Joao A Paulo; Linda Clarke; Lisa S Kearns; Vikkitharan Gnanasambandapillai; Chia-Ling Chan; Uyen Nguyen; Angela M Steinmann; Rachael A McCloy; Nona Farbehi; Vivek K Gupta; David A Mackey; Guy Bylsma; Nitin Verma; Stuart MacGregor; Matthew J Watt; Robyn H Guymer; Joseph E Powell; Alex W Hewitt; Alice Pébay
Journal:  Nat Commun       Date:  2022-07-26       Impact factor: 17.694

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.