| Literature DB >> 32286268 |
Chenwei Li1,2, Baolin Liu1,3, Boxi Kang1,2,3, Zedao Liu1,2, Yedan Liu1,3, Changya Chen4, Xianwen Ren5,6, Zemin Zhang7,8,9.
Abstract
Fast, robust and technology-independent computational methods are needed for supervised cell type annotation of single-cell RNA sequencing data. We present SciBet, a supervised cell type identifier that accurately predicts cell identity for newly sequenced cells with order-of-magnitude speed advantage. We enable web client deployment of SciBet for rapid local computation without uploading local data to the server. Facing the exponential growth in the size of single cell RNA datasets, this user-friendly and cross-platform tool can be widely useful for single cell type identification.Entities:
Mesh:
Year: 2020 PMID: 32286268 PMCID: PMC7156687 DOI: 10.1038/s41467-020-15523-2
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of SciBet algorithm.
a Training set Pre-process by calculating the mean gene expression form the original expression matrix. Here we use marker genes G1, G2, and G3 along with a non-marker gene G4 as examples. b Using E-test to select cell type-specific genes for the downstream classification. Genes with total entropy difference larger than the predefined threshold will be kept. Genes selected by E-test are used for the model training and prediction. c Training SciBet model by obtaining the parameters for the multinomial models of each cell type. For each cell type, the sum of all parameters belonging to different genes equals to 1, which represent the expression probability of different genes. d Calculating the likelihood function of a test cell using the trained SciBet model and annotating cell type for the test cell with maximum likelihood estimation. Each cell in the test set is independently annotated.
Fig. 2Cross-validation benchmarks.
a Performance of the feature selection methods measured by the accuracy score for n = 14 datasets (each dataset is plotted as an individual point, representing the mean accuracy score across 50 random repeats). Box plot shows the center line for the median, hinges for the interquartile range and whiskers for 1.5 times the interquartile range. b Single CPU consuming times for gene selection process with E-test, F-test and M3Drop (log scale). Solid lines are loess regression fitting (span = 2), implemented with R function geom_smooth. c Performance of the classifiers measured by the accuracy score for n = 14 datasets (each point represents the mean score across 50 repeats). d Performance of the classifiers is measured by the balanced accuracy score for n = 14 datasets (each point represents the mean score across 50 repeats). e Single CPU consuming times for classification (log scale). Solid lines are loess regression fitting (span = 2), implemented with R function geom_smooth. f Heatmap for the confusion matrix of the cross-validation result on the human PBMC dataset[10], with normalization for each column (origin label). g Heatmap for the confusion matrix of the cross-validation result on the human pancreatic dataset[11], with normalization for each column (origin label).
Fig. 3Applications of SciBet.
a Mean accuracy (across 50 repeats) for n = 6 cross-platform dataset pairs listed in Supplementary Table 2. b Cross-species classification with three human pancreas datasets projected to Tabula Muris dataset (Sankey diagram). The height of each linkage line reflects the number of cells. c Confusion matrix of the cross-validation result for 30 cell types in the “mock” human cell atlas (listed in Supplementary Table 3). d Single cell classification for a human liver dataset with integrated human dataset as reference, implemented by SciBet. e Confusion matrix for the case study of false positive control, with normalization for each row (origin label). Negative cells including malignant cells, CAF cells and endothelial cells were removed from the training set. Query cells with lowest classification confidence scores were labeled as unassigned. f False positive control evaluation with cell types not present in reference as negative cells, with n = 10 pairs of datasets (each point represents the mean accuracy score or FPR across 50 repeats). Box plot shows the center line for the median, hinges for the interquartile range and whiskers for 1.5 times the interquartile range. g Expression heatmap of the top 54 genes selected by E-test for the integrated immune dataset (Supplementary Table 6). h 2D-UMAP showing the dimensional reduction result based on the genes in g.