Literature DB >> 31718533

SCOPIT: sample size calculations for single-cell sequencing experiments.

Alexander Davis^1,2, Ruli Gao¹, Nicholas E Navin^3,4.

Abstract

BACKGROUND: In single cell DNA and RNA sequencing experiments, the number of cells to sequence must be decided before running an experiment, and afterwards, it is necessary to decide whether sufficient cells were sampled. These questions can be addressed by calculating the probability of sampling at least a defined number of cells from each subpopulation (cell type or cancer clone).
RESULTS: We developed an interactive web application called SCOPIT (Single-Cell One-sided Probability Interactive Tool), which calculates the required probabilities using a multinomial distribution (www.navinlab.com/SCOPIT). In addition, we created an R package called pmultinom for scripting these calculations.
CONCLUSIONS: Our tool for fast multinomial calculations provide a simple and intuitive procedure for prospectively planning single-cell experiments or retrospectively evaluating if sufficient numbers of cells have been sequenced. The web application can be accessed at navinlab.com/SCOPIT.

Entities: CellLine Chemical Disease Gene Species

Keywords: Multinomial distributions; Sample size; Single cell sequencing

Mesh：

Year: 2019 PMID： 31718533 PMCID： PMC6852764 DOI： 10.1186/s12859-019-3167-9

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Biological tissues consist of a heterogeneous mixture of cells, including a variety of cell types in normal tissue or subclones in tumor tissue. This heterogeneity can be resolved using single-cell DNA or RNA sequencing methods [1, 2]. Single-cell sequencing studies require sufficiently many cells to be sampled so that normal cell types or cancer subclones of interest (both hereafter referred to as “subpopulations”) are represented in the sample. In most studies, however, the total number of cells is determined arbitrarily by the limits of an instrumentation run, or by budget constraints, which may result in the sampling of too few or too many cells. Here, we have developed an interactive web tool, called SCOPIT (Single-Cell One-sided Probability Interactive Tool), which provides assistance for planning experiments, using calculations from a multinomial distribution.

Implementation

The first fact used for calculating multinomial probabilities is the well-known equivalence between the probability mass function of a multinomial distribution and conditional probabilities of a Poisson distribution. This equivalence was first noted, to our knowledge, by Fisher [3]. Theorem 1 Assume that where N and p are length k vectors, and . Also assume that for i = 1 to k, where λ = αp for some α. Furthermore, assume that X1…X are independent. Then for any event E, The second fact is a relationship between conditional Poisson probabilities, and an expression involving the sum of truncated Poisson random variables. The following is a slight variant of a theorem due to Levin [4]. Theorem 2 Let be a truncated Poisson random variable, with probability mass function where X is a Poisson random variable with rate λ. For vectors a and b, let X( be the vector containing all of these truncated Poisson random variables. Let E be the set of vectors x such that a < x ≤ b. Then, Proof: By Bayes’ theorem, Substituting for and for P(X ∈ E) yields the theorem. □ This theorem enables a fast calculation of the multinomial probability. The rate-limiting step is calculation of the probability distribution of . Levin [4] provided two suggestions for computing this probability distribution: the first by convolution of the distributions of each , and the second using an Edgeworth expansion of the probability distribution of . We implemented both suggestions, which are used for different values of n. For small values of n, convolution is performed, using The Fastest Fourier Transform In The West algorithm [5]. For large values of n, an Edgeworth expansion is used. However, whereas Levin [4] used the first four terms in the expansion, we continue adding terms until the last term added is sufficiently small. SCOPIT also computes Bayesian posterior probability distributions for the multinomial probabilities. The multinomial probabilities described above are a function of the population frequencies. When the true population frequencies are not known, but observed frequencies from a previous experiment are available, SCOPIT computes a posterior distribution for the frequencies. The prior used for the frequencies is Dirichlet(0, …, 0), following Jaynes [6] for an experiment in which the possible outcomes are not known in advance. The resulting posterior is Dirichlet(n1, …, n), where n is the number of cells observed from population i. Possible frequency vectors are randomly drawn from this posterior using the R package rBeta2009 [7, 8]. Then, the desired multinomial probability is calculated from each sampled frequency vector, resulting in samples from the posterior distribution of possible multinomial probabilities. A posterior distribution over the number of cells required is calculated in the same way.

Results

Estimating required sample size using the multinomial distribution

We make the simplifying assumption that a successful experiment requires sampling a sufficient number of representatives from each subpopulation of interest in the tissue. Defining c as the required number of representatives from each subpopulation, N as the number of cells of subpopulation i which are sampled, and k as the number of subpopulations of interest, then the probability of meeting this condition is Assuming that a fixed number of cells are chosen at random from the population, the distribution of N1, …, N is multinomial. To calculate this probability, we created an R implementation of a previously described algorithm [4], described further in the Implementation section. Our implementation is available for R scripting in the package “pmultinom”, available from CRAN (Table 1).

Table 1

Package functions for pmultinom. This table lists the R functions for the package “pmultinom” for calculating multinomial probabilities

Function	Arguments	Description
pmultinom	lower, upper, size, probs, method	Probability that a multinomial random vector is elementwise greater than “lower” and elementwise less than or equal to “upper”. “size” and “probs” specify the parameters of the multinomial distribution. Either “lower” or “upper” may be left unspecified.
invert.pmultinom	lower, upper, probs, target.prob, method	Returns the “size” parameter required for pmultinom to reach the target probability “target.prob”.

Package functions for pmultinom. This table lists the R functions for the package “pmultinom” for calculating multinomial probabilities Our web tool, SCOPIT, provides an interactive interface for multinomial calculations. SCOPIT provides both prospective and retrospective calculations, described below.

Prospective calculations

SCOPIT’s prospective mode is intended to estimate the number of cells that must be sampled in a single-cell sequencing experiment. Ideally, the number of cells can be decided by finding a number of cells, n∗, such that the above multinomial probability is above a specified success probability, p∗. Such a calculation would require specifying the frequency of each subpopulation of cells in the tissue, but the precise subpopulation frequencies are usually unknown before performing the experiment. The strategy implemented in the prospective mode is to specify the frequency of the rarest subpopulations that the researcher intends to find, as well as k, the number of populations with approximately this frequency. Both numbers are relevant, since it is harder to find, for example, 10 subpopulations with frequency 1%, than it is to find only one. The required number of cells is defined as follows: SCOPIT reports n∗ along with a plot of the probability as a function of the number of cells sequenced (Fig. 1a).

Fig. 1

SCOPIT interface. a. Interface for prospective calculations. Orange lines identify the number of cells required and the target probability of detecting a specified number of each subpopulation. b. Interface for retrospective calculations. The number of cells which were sequenced is entered, and is marked on the plot with a dotted green line. In this example, the orange line is far to the left of the dotted green line, suggesting that more cells were sequenced than required to detect these three subpopulations. To quantify confidence in the results, a dotted black line is plotted that shows the lower end of a 95% credible interval for the probability. The plot title states the upper end of a 95% credible interval for the number of cells required This mode requires only one subpopulation frequency to be specified: the minimum frequency among all subpopulations of interest. The SCOPIT interface does enable the user to add additional subpopulations with higher frequencies, but the user will find that these additional subpopulations have negligible effects on n∗, unless they are very close in frequency to the rarest subpopulations. This phenomenon justifies specifying only the lowest frequency.

Retrospective calculations

After an experiment has been performed, estimates of the subpopulation frequencies are available as input parameters. It is then possible to use SCOPIT in retrospective mode to estimate how many cells would be required, in a hypothetical replicate experiment, to detect all k observed subpopulations, with c representatives from each. In retrospective mode, the information required from the user consists of the total number of cells sequenced in a previous experiment, and the number of cells observed from each subpopulation. With this information, SCOPIT will calculate, for each number of cells n, the probability P(N1 ≥ c, N2 ≥ c, …, N ≥ c), assuming the true subpopulation frequencies are equal to the empirically observed ones. For example, in Fig. 1b, we use single cell DNA data from a triple-negative breast tumor [9] in which the authors sequenced N = 84 single cells and detected two major clonal subpopulations. Using SCOPIT we estimated that only 19 cells were required to detect the two subpopulations with a 0.95 probability, suggesting that this study sequenced about 4 times the number of cells that were necessary. Because the retrospective analysis involves uncertainty about the true frequencies of each population, SCOPIT provides measures of uncertainty using Bayesian credible intervals at a 95% confidence level. For the number of cells required, SCOPIT reports the upper end of a one-sided credible interval, which is interpretable as the highest number of cells consistent with the data. For the probability of obtaining a sufficient number of cells from each population, SCOPIT plots the lower end of a one-sided credible interval, interpretable as the lowest probability consistent with the data. In the example described above, the credible interval boundaries were close to the estimated values, indicating that the estimated values were strongly supported by the data provided. The retrospective tool is useful for planning a second experiment, assuming that all the subpopulations of interest were observed in the first experiment, and that the underlying subpopulation frequencies are consistent in both experiments. Although the exact subpopulation frequencies are not known, overconfident conclusions on the basis of limited information can be avoided using the credible intervals provided by the retrospective tool.

Comparison with independence approximation

Another previous software tool for estimating single cell sample sizes is an unpublished web application (https://satijalab.org/howmanycells). The previous tool is based upon two simplifying assumptions: that the subpopulations have equal frequencies, and that the observed frequencies of each subpopulation are statistically independent. Under these assumptions: where N represents the number of cells sampled from an arbitrary subpopulation. To compare the independence approximation method to SCOPIT, the required number of cells was calculated with and without the independence assumption (Table 2). The calculations performed under the independence assumption underestimated the required number of cells by at most 1 cell and were highly similar. These data suggests that using independence approximation is an alternative approach that can also be used for estimating single cell sample sizes.

Table 2

Comparison of Independent Approximation and Exact Calculations.

Subpopulation frequency	# of subpopulations	Cells required (exact)	Cells required (approx.)
0.1	6	186	186
0.2	3	85	85
0.3	2	53	53
0.1	8	191	191
0.2	4	87	87
0.4	2	39	39
0.1	9	193	193
0.3	3	55	55
0.1	10	195	194
0.2	5	89	89
0.5	2	30	30

The number of cells required to achieve a 95% certainty of sampling sufficiently many cells from each subpopulation. The number of cells was calculated in two ways: by an exact calculation, and by an approximate calculation in which the counts of different subpopulations were assumed to be independent

Comparison of Independent Approximation and Exact Calculations. The number of cells required to achieve a 95% certainty of sampling sufficiently many cells from each subpopulation. The number of cells was calculated in two ways: by an exact calculation, and by an approximate calculation in which the counts of different subpopulations were assumed to be independent

Discussion

SCOPIT’s function is to calculate the number of cells that must be sampled in a single-cell sequencing experiment, on the basis of input subpopulation frequencies, and under the assumption of random sampling. To achieve this goal, we implemented a fast multinomial probability calculation approach that is provided as open access software through the R package ‘pmultinom’. This method enables calculations at speeds sufficient for interactive plotting. The retrospective sample size calculation performed by SCOPIT is distinct from estimation of the number of undiscovered subpopulations [10] or the number likely to be discovered in further sampling [11], and can instead be interpreted as the required sample size of a replicate experiment which would detect the same subpopulations as the original experiment. To determine the number of cells required, SCOPIT calculates the probability of sampling sufficiently many representatives of each subpopulation. The probability calculated by SCOPIT is relevant to a wide variety of analyses and technologies, but specific technologies introduce additional experimental design considerations. For example, in single-cell differential expression analysis, it is important not only to sample sufficiently many cells, but also to sample sufficiently many transcripts from each cell. Other tools have been developed to calculate the probability of detecting a specific transcript [12], to calculate the power to detect differential expression [13], and to determine the number of cells and reads required to find accurate low-dimensional representations of single-cell RNA sequencing data [14]. Accommodating the unique aspects of other technologies and analyses is an important topic for future research in the design of single-cell sequencing experiments. A previous tool is available for calculating the number of cells to sequence (https://satijalab.org/howmanycells) and a direct comparison to SCOPIT shows that it generates results that are highly similar to SCOPIT, despite using independent approximations instead of exact probabilities. However SCOPIT offers several additional features, including the ability to enter multiple cell type frequencies, and interfaces to perform both prospective estimates of the sample sizes for planning experiments and retrospective calculations which include measures of confidence in the result. While SCOPIT can be used to decide how many cells to sample from a tissue, another important question is how many spatial regions to sample to capture the diversity of the population. In the case of sampling from tumor tissue, the question of how widely to sample can be addressed by simulating the generation of intratumor heterogeneity [15], followed by simulating sampling. However, simpler statistical calculations which avoid detailed simulations are currently not available and represent an important future direction.

Conclusions

This study reports a useful tool for estimating sample size calculations for planning single cell sequencing experiments prospectively and retrospectively. We expect that SCOPIT will have applications in many diverse areas of biology, and for planning experiments on a variety of single cell technologies (scDNA, scRNA and scATAC-seq).

Availability and requirements

Project name: SCOPIT Project homepage: https://github.com/navinlabcode/scopit Web interface: http://www.navinlab.com/SCOPIT Operating system: Platform independent Programming language: R License: AGPL v3

5 in total

1. The first five years of single-cell cancer genomics and beyond.

Authors: Nicholas E Navin
Journal: Genome Res Date: 2015-10 Impact factor: 9.043

2. Power analysis of single-cell RNA-sequencing experiments.

Authors: Valentine Svensson; Kedar Nath Natarajan; Lam-Ha Ly; Ricardo J Miragaia; Charlotte Labalette; Iain C Macaulay; Ana Cvejic; Sarah A Teichmann
Journal: Nat Methods Date: 2017-03-06 Impact factor: 28.547

3. Between-region genetic divergence reflects the mode and tempo of tumor evolution.

Authors: Ruping Sun; Zheng Hu; Andrea Sottoriva; Trevor A Graham; Arbel Harpak; Zhicheng Ma; Jared M Fischer; Darryl Shibata; Christina Curtis
Journal: Nat Genet Date: 2017-06-05 Impact factor: 38.330

Review 4. Experimental design for single-cell RNA sequencing.

Authors: Jeanette Baran-Gale; Tamir Chandra; Kristina Kirschner
Journal: Brief Funct Genomics Date: 2018-07-01 Impact factor: 4.241

5. Punctuated copy number evolution and clonal stasis in triple-negative breast cancer.

Authors: Ruli Gao; Alexander Davis; Thomas O McDonald; Emi Sei; Xiuqing Shi; Yong Wang; Pei-Ching Tsai; Anna Casasent; Jill Waters; Hong Zhang; Funda Meric-Bernstam; Franziska Michor; Nicholas E Navin
Journal: Nat Genet Date: 2016-08-15 Impact factor: 38.330

5 in total

10 in total

1. PhyDOSE: Design of follow-up single-cell sequencing experiments of tumors.

Authors: Leah L Weber; Nuraini Aguse; Nicholas Chia; Mohammed El-Kebir
Journal: PLoS Comput Biol Date: 2020-10-01 Impact factor: 4.475

Review 2. Temporal modelling using single-cell transcriptomics.

Authors: Jun Ding; Nadav Sharon; Ziv Bar-Joseph
Journal: Nat Rev Genet Date: 2022-01-31 Impact factor: 59.581

3. Sensei: how many samples to tell a change in cell type abundance?

Authors: Shaoheng Liang; Jason Willis; Jinzhuang Dou; Vakul Mohanty; Yuefan Huang; Eduardo Vilar; Ken Chen
Journal: BMC Bioinformatics Date: 2022-01-04 Impact factor: 3.169

4. scPower accelerates and optimizes the design of multi-sample single cell transcriptomic studies.

Authors: Katharina T Schmid; Barbara Höllbacher; Cristiana Cruceanu; Anika Böttcher; Heiko Lickert; Elisabeth B Binder; Fabian J Theis; Matthias Heinig
Journal: Nat Commun Date: 2021-11-16 Impact factor: 14.919

5. Applications of single-cell genomics and computational strategies to study common disease and population-level variation.

Authors: Benjamin J Auerbach; Jian Hu; Muredach P Reilly; Mingyao Li
Journal: Genome Res Date: 2021-10 Impact factor: 9.043

6. Heterogeneity in extracellular vesicle secretion by single human macrophages revealed by super-resolution microscopy.

Authors: Susanne Dechantsreiter; Ashley R Ambrose; Jonathan D Worboys; Joey M E Lim; Sylvia Liu; Rajesh Shah; M Angeles Montero; Anne Marie Quinn; Tracy Hussell; Gillian M Tannahill; Daniel M Davis
Journal: J Extracell Vesicles Date: 2022-04

7. Single-cell transcriptomic profiling unveils dysregulation of cardiac progenitor cells and cardiomyocytes in a mouse model of maternal hyperglycemia.

Authors: Sathiyanarayanan Manivannan; Corrin Mansfield; Xinmin Zhang; Karthik M Kodigepalli; Uddalak Majumdar; Vidu Garg; Madhumita Basu
Journal: Commun Biol Date: 2022-08-15

8. Unique molecular signatures of antiviral memory CD8⁺ T cells associated with asymptomatic recurrent ocular herpes.

Authors: Swayam Prakash; Soumyabrata Roy; Ruchi Srivastava; Pierre-Gregoire Coulon; Nisha R Dhanushkodi; Hawa Vahed; Allen Jankeel; Roger Geertsema; Cassandra Amezquita; Lan Nguyen; Ilhem Messaoudi; Amanda M Burkhardt; Lbachir BenMohamed
Journal: Sci Rep Date: 2020-08-14 Impact factor: 4.379

9. Hierarchicell: an R-package for estimating power for tests of differential expression with single-cell data.

Authors: Kip D Zimmerman; Carl D Langefeld
Journal: BMC Genomics Date: 2021-05-01 Impact factor: 4.547

10. Effects of Sample Size on Plant Single-Cell RNA Profiling.

Authors: Hongyu Chen; Yang Lv; Xinxin Yin; Xi Chen; Qinjie Chu; Qian-Hao Zhu; Longjiang Fan; Longbiao Guo
Journal: Curr Issues Mol Biol Date: 2021-10-20 Impact factor: 2.976

10 in total

Subpopulation frequency	# of subpopulations	Cells required (exact)	Cells required (approx.)
0.1	6	186	186
0.2	3	85	85
0.3	2	53	53
0.1	8	191	191
0.2	4	87	87
0.4	2	39	39
0.1	9	193	193
0.3	3	55	55
0.1	10	195	194
0.2	5	89	89
0.5	2	30	30

Subpopulation frequency	# of subpopulations	Cells required (exact)	Cells required (approx.)
0.1	6	186	186
0.2	3	85	85
0.3	2	53	53
0.1	8	191	191
0.2	4	87	87
0.4	2	39	39
0.1	9	193	193
0.3	3	55	55
0.1	10	195	194
0.2	5	89	89
0.5	2	30	30

Subpopulation frequency	# of subpopulations	Cells required (exact)	Cells required (approx.)
0.1	6	186	186
0.2	3	85	85
0.3	2	53	53
0.1	8	191	191
0.2	4	87	87
0.4	2	39	39
0.1	9	193	193
0.3	3	55	55
0.1	10	195	194
0.2	5	89	89
0.5	2	30	30