Literature DB >> 31636959

A novel approach to remove the batch effect of single-cell data.

Feng Zhang^1,2,3, Yu Wu^1,2, Weidong Tian^1,2,3,4.

Abstract

Entities: Gene Species

Keywords: Bioinformatics; Transcription

Year: 2019 PMID： 31636959 PMCID： PMC6796914 DOI： 10.1038/s41421-019-0114-x

Source DB: PubMed Journal: Cell Discov ISSN： 2056-5968 Impact factor: 10.849

× No keyword cloud information.

Dear Editor, Analyzing single-cell RNA sequencing (scRNA-seq) data from different batches is a challenging task[1]. The commonly used batch-effect removal methods, e.g. Combat[2,3] were initially developed for microarray or bulk RNA-seq data, and may not be appropriate for single-cell analysis in some situations[4]. Recently, several batch-effect removal tools specific for single-cell data have been developed. One of them is called canonical correlation analysis (CCA) subspace alignment (implemented in Seurat)[4], which conducts CCA and uses dynamic time warping to align the subspaces of different batches. However, CCA may lose the subspaces with the largest possible variance (can be identified by PCA), leading to wrong alignment result when the cell types of different batches are extremely imbalanced. To remove batch-effect from the PCA subspaces based on the correct cell alignment, a method called fastMNN[5] detects mutual nearest neighbors (MNN) of cells in different batches, and then uses the MNN to correct the values in each PCA subspace. Although fastMNN was shown to have a good performance, in practice it has long running time, and also lacks the explainability because of the correction of values in PCA subspace. A graph-based method named batch balanced KNN (BBKNN)[6] reduces batch-effect by creating connections between analogous cells in different batches. However, BBKNN only generates the final vectors (UMAP)[7], making it impossible to track the adjustment. In this study, we present a novel method called batch effect remover (BEER) for combining scRNA-seq data from different batches. The originality of BEER is that it uses the correlation of mutual nearest (MN) cell pairs identified from different batches to identify PCA sub-spaces with poor correlation (i.e., latent high batch-effect), and then removes these subspaces from further analysis. Because BEER does not change any values in PCA subspaces, the results produced by BEER are trackable and easily explainable. By using a cell-type imbalanced benchmark, we show that BEER has a clear advantage over four representative batch-effect removal tools: Combat, Seurat (CCA alignment), fastMNN, and BBKNN. BEER has been implemented in R. The inputs of BEER are two expression matrices (UMI or other un-scaled expression format) coming from two different batches. The row and column names of the two expression matrices are gene and cell names, respectively. The workflow of BEER includes two main parts (Fig. 1a). In the first part, for each expression matrix, BEER preprocesses the data and conducts t-distributed stochastic neighbor embedding (tSNE)[8] to transfer the data into one-dimension values. tSNE is used to do one-dimension reduction because of its robustness and well-recognized performance in the field of scRNA-seq analysis[9]. BEER groups cells (default number of cells in each group is 10) based on the order of the one-dimension values, and then aggregate the expression profiles of each cell in a group to obtain the representative expression profile for that group. Next, BEER calculates a Kendall’s tau to evaluate the distance of each pair of cell group from two batches, and identifies all MN pairs of cell groups in between the two batches. In the second part, BEER directly combines two expression matrices, normalizes the data, and conduct PCA to produce a number (default is 50) of subspaces. Because two cell groups in a MN-paired cell groups represent the most similar groups in those two batches, they should have similar values in each PCA subspace if there is no batch effect. Thus, by calculating the correlation between MN-paired cell groups in each subspace, BEER identifies those with poor correlation and considers them to have latent high batch-effect. Finally, BEER simply removes those PCA subspaces with latent batch effect, and no values in the other subspaces are changed (details are provided in Supplementary information). Note that it is likely that a removed PCA subspace may also have biological variances. A workflow has been provided to help users determine whether a PC removed by BEER has biological meaning (see Supplementary information for the workflow); then, other methods, such as ComBat, may be used to modify this PC.

Fig. 1

Workflow and benchmark study of BEER.

Workflow and benchmark study of BEER.

a Shows the workflow of BEER. In the “Embed” step, we use tSNE to transfer single-cell expression matrix into one-dimension values. When detecting mutual nearest (MN) pairs, we use Kendall’s tau (“cor.fk” function of “pcaPP” package in R) to evaluate the distance (higher Kendall’s tau means shorter distance). We use “cor.test(method = ’kendall’)” in R to test the correlation between MN-paired cell groups. Details are provided in Supplementary information. b Shows the basic information of the benchmark data sets. “Batch1” is derived from a cortex study[10], while “Batch2” is derived from an oligodendrocyte study[11]. The third row shows the number of cell types (or cells in parenthesis) in “Oligodendrocytes”. Details about those two batches are in Supplementary information. c Shows the summary of the methods being compared in this study. “C”, “B”, “S”, and “M” stand for “Combat”, “BBKNN”, “Seurat (CCA alignment)”, and “fastMNN”, respectively. “Cell Type Sense” means that the method can sense same-type cells across different batches. “Change Subspace” means that the method changes the values of PCA (or CCA) subspace. Details about the competing methods are in Supplementary information. d–g The UMAP figures show the output of each method. For figures with “Oligodend_batch1” and “Oligodend_batch2”, the red and blue points are oligodendrocytes in batch1 and batch2, respectively. Figures with three labels show the location of three different cell types that should be separated due to their biological difference in UMAP. UMAP figures with all cell-type labels in high resolution are shown in Supplementary information We apply BEER and other four representative batch-effect removal methods (Combat, BBKNN, Seurat CCA alignment, and fastMNN) to a stringent cell-type imbalanced benchmark. In this benchmark, there are two batches: one is from a mouse cortex study[10], and another is from a mouse oligodendrocyte study[11]. Except the cell type named “Oligodendrocytes”, the other cell types of those two batches are completely different (Fig. 1b and Supplementary information). The total number of cells in this benchmark is 8074. The running time of almost all methods is about 1–5 min, while fastMNN uses 35 min (Fig. 1c). We apply UMAP to visualize the output of each method. As can be seen in Fig. 1d, Combat and Seurat (CCA alignment) fail to mix oligodendrocyte cells from the two batches. Although oligodendrocyte cells from the two batches are mixed by fastMNN and BBKNN, these two methods fail to separate biologically different cell types of different batches (Fig. 1e, f): fastMNN mixes Astrocyte_batch1, OPC_batch2, and Microglia_batch1 together (Fig. 1e), while BBKNN mixes Oligodendrocytes_batch1&batch2, Pyramidal SS_batch1, and Interneurons_batch1 together (Fig. 1f). In contrast, BEER not only successfully mixes oligodendrocytes of two different batches together, but also separates the cell types that are not separated by fastMNN and BBKNN into different locations (Fig. 1g), showing a clear advantage over other methods. We have inspected BEER’s performance to the change of tSNE perplexity values and the change of cell group size (for aggregating expression profiles), and have found that BEER is fairly robust to these changes (Supplementary information). We have also used a quantitative metric-Silhouette coefficient to compare the performance of different methods for removing batch-effects, and have demonstrated that BEER clearly outperforms the other methods (Supplementary information). In addition, for batch-effect removal of more than two batches, we have provided a function named “MBEER” which identifies the batch with the most number of cells as the target batch, and applies BEER iteratively for comparing the other batches with the target batch (for details, see Supplementary information). Alternatively, users can define the target batch, and then apply “MBEER” for batch-effect removal of more than two batches. In conclusion, BEER has three main features: (a) BEER can mix the same-type cells of different batches without losing the identities of different types of cells in different batches. (b) All steps of BEER are transparent and trackable. (c) BEER is efficient, and the “parallel” package has been implemented in BEER for multi-threads processing. A user guide of BEER is provided in Supplementary information. For convenience, BEER and all scripts of this study are available at https://github.com/jumphone/BEER Supplementary information.

7 in total

1. Adjusting batch effects in microarray expression data using empirical Bayes methods.

Authors: W Evan Johnson; Cheng Li; Ariel Rabinovic
Journal: Biostatistics Date: 2006-04-21 Impact factor: 5.899

2. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.

Authors: Amit Zeisel; Ana B Muñoz-Manchado; Simone Codeluppi; Peter Lönnerberg; Gioele La Manno; Anna Juréus; Sueli Marques; Hermany Munguba; Liqun He; Christer Betsholtz; Charlotte Rolny; Gonçalo Castelo-Branco; Jens Hjerling-Leffler; Sten Linnarsson
Journal: Science Date: 2015-02-19 Impact factor: 47.728

3. Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Authors: Andrew Butler; Paul Hoffman; Peter Smibert; Efthymia Papalexi; Rahul Satija
Journal: Nat Biotechnol Date: 2018-04-02 Impact factor: 54.908

4. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.

Authors: Laleh Haghverdi; Aaron T L Lun; Michael D Morgan; John C Marioni
Journal: Nat Biotechnol Date: 2018-04-02 Impact factor: 54.908

5. A test metric for assessing single-cell RNA-seq batch correction.

Authors: Maren Büttner; Zhichao Miao; F Alexander Wolf; Sarah A Teichmann; Fabian J Theis
Journal: Nat Methods Date: 2018-12-20 Impact factor: 28.547

Review 6. Challenges in unsupervised clustering of single-cell RNA-seq data.

Authors: Vladimir Yu Kiselev; Tallulah S Andrews; Martin Hemberg
Journal: Nat Rev Genet Date: 2019-05 Impact factor: 53.242

7. Oligodendrocyte heterogeneity in the mouse juvenile and adult central nervous system.

Authors: Sueli Marques; Amit Zeisel; Simone Codeluppi; David van Bruggen; Ana Mendanha Falcão; Lin Xiao; Huiliang Li; Martin Häring; Hannah Hochgerner; Roman A Romanov; Daniel Gyllborg; Ana Muñoz Manchado; Gioele La Manno; Peter Lönnerberg; Elisa M Floriddia; Fatemah Rezayee; Patrik Ernfors; Ernest Arenas; Jens Hjerling-Leffler; Tibor Harkany; William D Richardson; Sten Linnarsson; Gonçalo Castelo-Branco
Journal: Science Date: 2016-06-10 Impact factor: 47.728

7 in total

13 in total

Review 1. Single Cell RNA Sequencing in Atherosclerosis Research.

Authors: Jesse W Williams; Holger Winkels; Christopher P Durant; Konstantin Zaitsev; Yanal Ghosheh; Klaus Ley
Journal: Circ Res Date: 2020-04-23 Impact factor: 17.367

2. scPCOR-seq enables co-profiling of chromatin occupancy and RNAs in single cells.

Authors: Lixia Pan; Wai Lim Ku; Qingsong Tang; Yaqiang Cao; Keji Zhao
Journal: Commun Biol Date: 2022-07-08

3. Robust integration of multiple single-cell RNA sequencing datasets using a single reference space.

Authors: Yang Liu; Tao Wang; Bin Zhou; Deyou Zheng
Journal: Nat Biotechnol Date: 2021-03-25 Impact factor: 54.908

Review 4. Single-Cell RNA-Seq Technologies and Computational Analysis Tools: Application in Cancer Research.

Authors: Qianqian Song; Liang Liu
Journal: Methods Mol Biol Date: 2022

Review 5. Computational methods for the integrative analysis of single-cell data.

Authors: Mattia Forcato; Oriana Romano; Silvio Bicciato
Journal: Brief Bioinform Date: 2021-01-18 Impact factor: 11.622

6. Age-related injury responses of human oligodendrocytes to metabolic insults: link to BCL-2 and autophagy pathways.

Authors: Milton Guilherme Forestieri Fernandes; Julia Xiao Xuan Luo; Qiao-Ling Cui; Kelly Perlman; Florian Pernin; Moein Yaqubi; Jeffery A Hall; Roy Dudley; Myriam Srour; Charles P Couturier; Kevin Petrecca; Catherine Larochelle; Luke M Healy; Jo Anne Stratton; Timothy E Kennedy; Jack P Antel
Journal: Commun Biol Date: 2021-01-04

7. Cellular plasticity balances the metabolic and proliferation dynamics of a regenerating liver.

Authors: Ullas V Chembazhi; Sushant Bangru; Mikel Hernaez; Auinash Kalsotra
Journal: Genome Res Date: 2021-03-01 Impact factor: 9.043

8. A molecular atlas of innate immunity to adjuvanted and live attenuated vaccines, in mice.

Authors: Audrey Lee; Madeleine K D Scott; Florian Wimmers; Prabhu S Arunachalam; Wei Luo; Christopher B Fox; Mark Tomai; Purvesh Khatri; Bali Pulendran
Journal: Nat Commun Date: 2022-01-27 Impact factor: 14.919

9. A Population of CD4⁺CD8⁺ Double-Positive T Cells Associated with Risk of Plasma Leakage in Dengue Viral Infection.

Authors: Esther Dawen Yu; Hao Wang; Ricardo da Silva Antunes; Yuan Tian; Rashmi Tippalagama; Shakila U Alahakoon; Gayani Premawansa; Ananda Wijewickrama; Sunil Premawansa; Aruna Dharshan De Silva; April Frazier; Alba Grifoni; Alessandro Sette; Daniela Weiskopf
Journal: Viruses Date: 2022-01-05 Impact factor: 5.818

10. Diagnostic Evidence GAuge of Single cells (DEGAS): a flexible deep transfer learning framework for prioritizing cells in relation to disease.

Authors: Travis S Johnson; Christina Y Yu; Zhi Huang; Siwen Xu; Tongxin Wang; Chuanpeng Dong; Wei Shao; Mohammad Abu Zaid; Xiaoqing Huang; Yijie Wang; Christopher Bartlett; Yan Zhang; Brian A Walker; Yunlong Liu; Kun Huang; Jie Zhang
Journal: Genome Med Date: 2022-02-01 Impact factor: 11.117