| Literature DB >> 31405383 |
Tongxin Wang1, Travis S Johnson2,3, Wei Shao3, Zixiao Lu4, Bryan R Helm3, Jie Zhang5, Kun Huang6,7.
Abstract
To fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for identifying cell lineages and bona fide transcriptional signals, it is necessary to combine data from multiple experiments. We present BERMUDA (Batch Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch effect correction in scRNA-seq data. BERMUDA effectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate that BERMUDA outperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets.Entities:
Keywords: Autoencoder; Batch effect; RNA-seq; Single cell; Transfer learning
Mesh:
Year: 2019 PMID: 31405383 PMCID: PMC6691531 DOI: 10.1186/s13059-019-1764-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of BERMUDA for removing batch effects in scRNA-seq data. a The workflow of BERMUDA. Circles and triangles represent cells from Batch 1 and Batch 2, respectively. Different colors represent different cell types. A graph-based clustering algorithm was first applied on each batch individually to detect cell clusters. Then, MetaNeighbor, a method based on Spearman correlation, was used to identify similar clusters between batches. An autoencoder was subsequently trained to perform batch correction on the code of the autoencoder. The code of the autoencoder is a low-dimensional representation of the original data without batch effects and can be used for further analysis. b Training an autoencoder to remove batch effects. The blue solid lines represent training with the cells in Batch 1 and the blue dashed lines represent training with cells in Batch 2. The black dashed lines represent the calculation of losses. The loss function we optimized contains two components: the reconstruction loss between the input and the output of the autoencoder, and the MMD-based transfer loss between the codes of similar clusters
Datasets used for evaluation of BERMUDA
| Dataset | Batch | Protocol | Number of cells | Cell types (number of cells) |
|---|---|---|---|---|
| 2D Gaussian | Batch1 | Simulation | 2000 | Type1 (855), Type2 (237), Type3 (373), Type4 (535) |
| Batch2 | Simulation | 2000 | Type1 (379), Type2 (656), Type3 (102), Type4 (863) | |
| Splatter | Batch1 | Splatter | 2000 | Type1 (810), Type2 (616), Type3 (388), Type4 (186) |
| Batch2 | Splatter | 1000 | Type1 (439), Type2 (262), Type3 (201), Type4 (98) | |
| Pancreas | Muraro | CEL-Seq2 | 2042 | alpha (812), beta (448), gamma (101), delta (193), epsilon (3), acinar (219), ductal (245), endothelial (21) |
| Baron | inDrop | 8012 | alpha (2326), beta (2525), gamma (255), delta (601), epsilon (18), acinar (958), ductal (1077), endothelial (252) | |
| Segerstolpe | SMART-seq2 | 2061 | alpha (886), beta (270), gamma (197), delta (114), epsilon (7), acinar (185), ductal (386), endothelial (16) | |
| PBMC | PBMC | 10x Chromium | 8381 | |
| Pan T cell | 10x Chromium | 3555 |
For the consistency in the paper, we refer to each pancreas and PBMC dataset, e.g., the Muraro dataset, as a batch, and refer to multiple batches investigating similar biological systems, e.g., pancreas, as a dataset. Type1 to Type4 in the 2D Gaussian dataset and the Splatter dataset refer to the virtual cell types that were generated using simulation techniques
Experiments performed for comparing BERMUDA with existing methods
| Dataset | Batch | Experiment name | Cell type removed |
|---|---|---|---|
| 2D Gaussian |
|
| NA |
|
| Type1 from | ||
|
| Type1 from Type4 from | ||
| Splatter |
|
| NA |
|
| Type1 from | ||
|
| Type1 from Type4 from | ||
| Pancreas |
|
| NA |
|
| alpha and beta from | ||
|
| NA | |
|
| alpha and beta from alpha and beta from | ||
| PBMC |
| NA | NA |
Fig. 2Removing batch effects in simulated scRNA-seq data. a UMAP visualizations of results for Experiment removal2 on 2D Gaussian dataset, where Type1 from Batch1 and Type4 from Batch2 were removed. BERMUDA_0.85 and BERMUDA_0.90 represent results of BERMUDA with S = 0.85 and 0.90, respectively. b UMAP visualizations of results for Experiment removal2 on Splatter dataset, where Type1 from Batch1 and Type 4 from Batch2 were removed
Fig. 3Removing batch effects in scRNA-seq data of pancreas cells. a UMAP visualizations of batch effect removal results for Experiment all on pancreas dataset. Identified alpha and beta cell subpopulations in the Baron batch are highlighted with dashed circles. b Expression patterns of differently expressed genes within alpha and beta cells colored by log-transformed TPM values. Statistical significance of differential expression analysis is listed in Additional file 1: Table S1-S2
Fig. 4Removing batch effects in scRNA-seq data of PBMCs. a Expression patterns of marker genes of immune cells colored by log-transformed TPM values. b The UMAP visualization of results produced by BERMUDA colored by batches. Different cell types were identified by analyzing expression patterns of maker genes and are highlighted by dashed circles. BERMUDA correctly combined the corresponding cell types between different batches while preserved cell types not shared by both batches as separate clusters. c The UMAP visualization of pan T cell batch. No obvious clustering structure was observed when visualizing pan T cell batch individually