| Literature DB >> 32034137 |
Martin Jinye Zhang1, Vasilis Ntranos1,2, David Tse3.
Abstract
An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget: deep sequencing of a few cells or shallow sequencing of many cells? Here we present a mathematical framework which reveals that, for estimating many important gene properties, the optimal allocation is to sequence at a depth of around one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widely-used plug-in estimator, but one developed via empirical Bayes.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32034137 PMCID: PMC7005864 DOI: 10.1038/s41467-020-14482-y
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Optimal sequencing budget allocation.
a Description of the sequencing budget allocation problem. Consider estimating the underlying gene distribution (top) from the noisy read counts obtained via sequencing (bottom). With a fixed number of reads to be sequenced, deep sequencing of a few cells accurately estimates each individual cell but lacks coverage of the entire distribution (left), whereas a shallow sequencing of many cells covers the entire population but introduces a lot of noise (right). b Optimal tradeoff. The memory T-cell marker gene S100A4 has 41.7k reads in the pbmc_4k dataset. For estimating the underlying gamma distribution , the relative error is plotted as a function of the sequencing depth, where the optimal error is obtained at a depth of one read per cell (orange star) and is two times smaller than that at the current depth of pbmc_4k (red triangle). c Experimental design. To determine the sequencing depth for an experiment, first the relative gene expression level can be obtained via pilot experiments or previous studies (top left). Then the researcher can select a set of genes of interest (i.e., some marker genes highlighted as black dots), of which the smallest relative expression level (MS4A1) defines the reliable detection limit. Finally, the optimal sequencing depth is determined as (top right). The errors under different tradeoffs are visualized as a function of the genes ordered from the most expressed to the least (bottom). The optimal sequencing budget allocation (orange) minimizes the worst-case error over all the genes of interest (left of the red dashed line), whereas both the deeper sequencing (green) and the shallower sequencing (blue) yield worse results.
Fig. 2Empirical quantification of the optimal sequencing depth.
a Simulations of error under different budget allocation. 3-std confidence intervals are provided. The top panel simulates the error for estimating the first principal direction using the plug-in estimator (blue) and the EB estimator (orange), respectively. Three budgets are considered, i.e., = 0.6 k per gene, = 3k per gene, = 15k per gene. The depth (mean reads per cell per gene) ranges from 0.02 to 10. The result indicates that the optimal depth for the EB estimator is the same (~0.1) for all three budgets, validating the theory that the optimal depth is independent of the budget. The cases for the coefficient of variation and the Pearson correlation (bottom) also show similar qualitative behaviors. b Post hoc guidance for reliable estimation. We visualized the top 4k genes of some representative datasets (top), where a triangle residing in the green region means the Pearson correlation of corresponding genes can be reliably estimated (relative error < 10%). For example, we can reliably estimate the first 2k genes for the brain_1k dataset and all 4k genes for the brain_9k dataset. A more comprehensive result is summarized in the bottom table. For example, the first element (mean, 1k) shows that with 1k cells, a gene needs to have at least 0.1 reads per cell for reliably estimating the mean.
Fig. 3EB estimates are consistent between deep and shallow datasets.
a Top: for estimating the coefficient of variation (CV), the plug-in estimates become more inflated as the sequencing depth becomes shallower (from right to left along the x axis), whereas the EB estimates are consistent. 3-std confidence intervals are provided for this panel. Middle: both brain_1k and brain_1.3m are from the mouse brain, and hence each gene should have a similar CV value between the two datasets. This is indeed the case for the EB estimator (right), which is adaptive to different sequencing depths. However, as brain_1k is twice deeper than brain_1.3m, the plug-in estimates are biased that most points are above the 45-degree line (red). Bottom: distribution recovery for the gene GZMA from a dataset that is subsampled to be five times shallower (left). The EB estimator provides a reasonable estimation for both the zero proportion and the tail shape, resulting in a small total variation error (right). b Feature selection and PCA. The task is to first select features (genes) based on CV, and then perform PCA on the selected features. The results on the full data (pbmc_4k) and a subsampled (three times shallower) are compared. EB estimates are more consistent between the full data and the subsampled data for both the CV ranks (top) and the PCA plots (bottom).
Fig. 4Gene module and gene network analysis.
a Top: the EB-estimated Pearson correlation for some marker genes in pbmc_4k are visualized, ordered by different cell populations (top). The clear block-diagonal structure implies that the EB estimator is capable of capturing the gene functional groups. As a comparison, the plug-in estimator also recovers those modules but with a weaker contrast (bottom left panel, plug-in with 100%). Bottom: a subsample experiment further shows that the EB estimator can recover the module with 5% of the data. For the plug-in estimator, the first block (T cells) is blurred with 25% of the data, and the entire structure vanishes with 10% of the data. b Gene network based on the EB-estimated Pearson correlation using the pbmc_4k dataset. Most gene modules correspond to important cell types or functions, including T cells, B cells, NK-cells, myeloid-derived cells, megakaryocytes/platelets, ribosomal protein genes, and mitochondrially encoded protein-coding genes. c Left: the estimated Pearson correlations between all genes and LCK (1st panel) and CD3D (2nd panel), two known T-cell markers. There are three modes for the EB-estimated values, where the positive mode, the zero mode, and the negative mode correspond to genes in the same module, different modules, and irrelevant genes, respectively. The plug-in estimated values are nonetheless much closer to zero even for the truly correlated ones, indicating an artificial shrinkage of the estimated values. Right: two instances where the EB estimates are significantly different from the plug-in estimates. The axes represent read counts, and the color codes the number of cells. Both gene pairs are biologically validated (see Gene network analysis in Methods). See also Supplementary Figs. 11–12 for more examples.
Fig. 5Validation using smFISH data.
a The estimated CV (top) and inactive probability (bottom, ) from the Drop-seq data are compared with the smFISH results. The EB estimates (right) are consistent with the smFISH results while there is a clear inflation for the plug-in estimates. b The sequencing budget tradeoff for estimating CV (top) and inactive probability (bottom, , i.e., estimating the zero proportion at 1 read per cell) for the gene MITF. The relative error is evaluated against the gold standard smFISH result. 3-std confidence intervals are provided.
Comparison of the plug-in estimator and the EB estimator.
| plug-in | EB | |
|---|---|---|
| 1st moment | same | |
| 2nd moment | ||
| 1st pairwise moment | same | |
| Inactively probability | ||
| Pairwise inactive probability | ||
| Distribution | Empirical distribution of |
The two estimators are written in similar forms for better comparison. For the inactive probability (and the pairwise case), is a coefficient that depends on , , and . See Inactive probability in Supplementary Note 5 for the exact expression and other details.