| Literature DB >> 31584606 |
Wenhao Tang1, François Bertaux1,2,3, Philipp Thomas1, Claire Stefanelli1, Malika Saint2,3, Samuel Marguerat2,3, Vahid Shahrezaei1.
Abstract
MOTIVATION: Normalization of single-cell RNA-sequencing (scRNA-seq) data is a prerequisite to their interpretation. The marked technical variability, high amounts of missing observations and batch effect typical of scRNA-seq datasets make this task particularly challenging. There is a need for an efficient and unified approach for normalization, imputation and batch effect correction.Entities:
Mesh:
Substances:
Year: 2020 PMID: 31584606 PMCID: PMC7703772 DOI: 10.1093/bioinformatics/btz726
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A binomial model of mRNA capture is consistent with the statistics of raw experimental scRNA-seq data. (a) Cartoon illustration of the bayNorm approach. Only a fraction of the total number of mRNAs present in the cell is captured during scRNA-seq library preparation. This occurs with a global probability called capture efficiency (β). Using cell-specific estimates of β, bayNorm aims at recovering the original number of mRNA of each gene present in each cell. Comparisons between raw experimental scRNA-seq data from the Klein study (Klein ) and synthetic data obtained using the Binomial_bayNorm (orange), Binomial_Splatter (blue) or Splatter (Zappia ) (green) simulation protocols (see Supplementary Note S2 for details). (b) Variance versus mean expression relationship. (c) Dropout rates versus mean expression relationship (note that Binomial_Splatter and Binomial_bayNorm are on top of each other in this panel). The dotted line shows the function. (d) Distribution of dropout values per gene. (e) Distribution of dropout values per cell. (Color version of this figure is available at Bioinformatics online.)
Fig. 2.bayNorm recovers distributions of gene expression observed by smFISH. (a) Stag3 mRNA distribution for cells grown in 2i measured by smFISH or by scRNA-seq and normalized with different methods (from Grün study). ‘Raw’ denotes unnormalized scRNA-seq data. (b) As in (a) for the LMNA gene (from Torre study). Legend as in (a). Smoothing bandwidth is 10 for every method shown in (a and b). (c) Log2 ratio between the means of scRNA-seq measurements for 18 genes normalized by different methods and their matched smFISH measurements (from Grün study). (d) As in (c) using 12 genes (Torre study). (e) Log2 ratio between the CV of scRNA-seq measurements for 18 genes normalized by different methods and their matched smFISH measurements (from Grün study). (f) As in (e) using 12 genes (from Torre study). (g) Log2 ratio between the Gini coefficients of scRNA-seq measurements for 18 genes normalized by different methods and their matched smFISH measurements (from Grün study). (h) As in (c) using 12 genes (from Torre study). For the bayNorm and SAVER normalized datasets, 20 or 5 samples were generated from posterior distributions for the Grün and the Torre studies, respectively. All normalized datasets except bayNorm and the Scaling method have been divided by the value used in bayNorm procedure. For this analysis smFISH data were normalized for variation in total transcript numbers using either cell size measurements (Grün study) or expression levels of a house keeping gene (Torre study) as detailed in see Supplementary Note S4
Fig. 3.bayNorm enables robust and sensitive DE analysis. (a) Number of differentially expressed genes between the 1000 cells with the highest and the 1000 cells with the lowest total counts in Saint study (Saint ). DE genes were called using the MAST package () and plotted for six groups of genes with increasing mean expression (1—low to 6—high). (b) log2 fold-change from (a). Inset shows box plots of total count and cell sizes (as measured in Saint ) in the two groups, illustrating lack of strong correlation between scRNA-seq raw total count and cell size. (c) DE analysis using MAST for different normalization methods (Islam study) using a benchmark list of DE genes obtained from matched bulk RNA-seq data Ye . (d) DE analysis using data from Soumillon study (Soumillon ). 20, 50, 80, 100, 200 or 400 cells were selected randomly from two groups of stage-3 differentiated cells at day 0 (D3T0) or day 7 (D3T7). A list of DE genes obtained from matched bulk RNA-seq data was used as a benchmark (1000 genes with the largest magnitude of log fold-change between the D3T0 and D3T7 samples, Ye ). For bayNorm and SAVER, 3D arrays were used
Fig. 4.Batch effect correction and cell type identification (a and b) each color represent a different cell line derived from a different individual. Color shades represent different batches within a line/individual. (c) Differentially expressed genes were called between lines NA19101 and NA19239 as well as different batches within each line (seven pair of comparisons in total). FDRs were averaged across the seven pairs. The vertical and horizontal dashed lines represent 0.25 and 0.75 indicative cutoffs, respectively. bayNorm was applied either across batches but within lines [‘bayNorm local (individual)’] or across all cells (‘bayNorm global’) or within each batch [‘bayNorm local (batch)’]. Global gene-specific prior parameter estimation across all cells results in clear clusters of different cell types compared with Scaling normalization using the data from Zeisel . t-SNE plots are shown based on Scaling normalization (d) and bayNorm (e). The clustering performance is quantified by Jaccard index (the value reported at the top left of each panel). For bayNorm, 1 sample of 3D array was used. (Color version of this figure is available at Bioinformatics online.)