| Literature DB >> 30967541 |
Zhe Sun1, Li Chen2, Hongyi Xin3, Yale Jiang3,4, Qianhui Huang5, Anthony R Cillo6, Tracy Tabib7, Jay K Kolls8, Tullia C Bruno6,9, Robert Lafyatis7, Dario A A Vignali6,9,10, Kong Chen11, Ying Ding12, Ming Hu13, Wei Chen14,15.
Abstract
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals. Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a Bayesian mixture model for single-cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Results from extensive simulation studies and applications of BAMM-SC to in-house experimental scRNA-seq datasets using blood, lung and skin cells from humans or mice demonstrate that BAMM-SC outperformed existing clustering methods with considerable improved clustering accuracy, particularly in the presence of heterogeneity among individuals.Entities:
Mesh:
Year: 2019 PMID: 30967541 PMCID: PMC6456731 DOI: 10.1038/s41467-019-09639-3
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Sample information of real scRNA-seq datasets and the model structure in BAMM-SC. a UMI counts per cell of three droplet-based scRNA-seq datasets. In the boxplots, the box spans from the first to third quartile (depicting median as a line in the middle), the whiskers extend to 1.5× IQR (interquartile range). b An overall workflow of BAMM-SC
Fig. 2Boxplots of ARIs for 10 clustering methods across 100 simulations. a Investigating how different heterogeneities among multiple individuals (measured by mean values) affect clustering results. The simulated dataset consists of 10 individuals with 400 cells for each. b Investigating how different numbers of individuals affect clustering results. We set the level of heterogeneity (mean of ) among individuals as 0.1. In the boxplots, the box spans from the first to third quartile (depicting median as a line in the middle), the whiskers extend to 1.5× IQR (interquartile range)
Fig. 3Boxplots of ARI for 10 clustering methods across 100 simulations using Splatter. a Investigating how different levels of group effect affect clustering results. We set the mean parameters of three cell types as (0.20, 0.21, 0.22), (0.20, 0.22, 0.24), and (0.20, 0.24, 0.28) to represent three levels (low, medium, and high) of group difference. b Investigating how different levels of batch effect affect clustering results. We set the mean parameters of the five individuals as (0.1, 0.1, 0.1, 0.1, 0.1), (0.12, 0.12, 0.12, 0.12, 0.12), and (0.14, 0.14, 0.14, 0.14, 0.14) to represent three levels (low, medium, and high) of batch effects. In the boxplots, the box spans from the first to third quartile (depicting median as a line in the middle), the whiskers extend to 1.5× IQR (interquartile range)
Performance of clustering across ten times analyses for three real datasets
| Method | Mean_P | SD_P | Range_P | Mean_L | SD_L | Range_L | Mean_S | SD_S | Range_S |
|---|---|---|---|---|---|---|---|---|---|
| MNN+K-means | 0.379 | 0.083 | (0.283–0.485) | 0.662 | 0.066 | (0.596–0.815) | 0.597 | 0.075 | (0.461–0.676) |
| MNN+TSCAN | 0.373 | NA | NA | 0.720 | NA | NA | 0.553 | NA | NA |
| MNN+SC3 | 0.348 | 0.084 | (0.266–0.511) | 0.640 | 0.061 | (0.556–0.687) | 0.517 | 0.034 | (0.436–0.557) |
| MNN+Seurat | 0.325 | NA | NA | 0.749 | NA | NA | 0.647 | NA | NA |
| CCA+K-means | 0.414 | 0.056 | (0.307–0.464) | 0.695 | 0.114 | (0.505–0.883) | 0.619 | 0.129 | (0.424–0.737) |
| CCA+TSCAN | 0.210 | NA | NA | 0.611 | NA | NA | 0.398 | NA | NA |
| CCA+SC3 | 0.145 | 0.052 | (0.051–0.215) | 0.610 | 0.068 | (0.531–0.708) | 0.369 | 0.071 | (0.277–0.488) |
| CCA+Seurat | 0.468 | NA | NA | 0.729 | NA | NA | 0.702 | NA | NA |
| DIMM-SC | 0.333 | 0.071 | (0.302–0.529) | 0.809 | 0.030 | (0.742–0.868) | 0.715 | 0.045 | (0.671–0.779) |
| BAMM-SC | 0.487 | 0.056 | (0.362–0.532) | 0.882 | 0.042 | (0.764–0.910) | 0.762 | 0.032 | (0.717–0.843) |
Columns Mean_P, SD_P, and Range_P were calculated from human PBMC dataset. Columns Mean_L, SD_L, and Range_L were calculated from mouse lung dataset. Columns Mean_S, SD_S, and Range_S were calculated from human skin dataset.
Fig. 4The performance of BAMM-SC clustering for three in-house scRNA-seq datasets. The t-SNE projection of cells (colored by the approximated truth and BAMM-SC clustering results) and bar plots of proportions of cell types among all individuals for a human PBMC, b mouse lung, and c human skin tissues, separately. BAMM-SC clustering labels are from the result with the highest ARI among ten times analysis
Fig. 5The Boxplots of ARI for BAMM-SC across 100 simulation. It demonstrates the clustering accuracy under different proportions of cells being selected in the training set. In the boxplots, the box spans from the first to third quartile (depicting median as a line in the middle), the whiskers extend to 1.5× IQR (interquartile range)