Literature DB >> 16512900

A simple method for assessing sample sizes in microarray experiments.

Robert Tibshirani1.   

Abstract

BACKGROUND: In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments.
RESULTS: Our method starts with the output from a permutation-based analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various samples sizes, we estimate the false discovery rate and false negative rate of a list of genes; these are also interpretable as per gene power and type I error. We also discuss application of our method to other kinds of response variables, for example survival outcomes.
CONCLUSION: Our method seems to be useful for sample size assessment in microarray experiments.

Entities:  

Mesh:

Year:  2006        PMID: 16512900      PMCID: PMC1450307          DOI: 10.1186/1471-2105-7-106

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Assessment of sample sizes for microarray data is a tricky exercise. The data are complex, as are the biological questions that one might try to answer from such data. What assumptions should one make, and what quantities should be provided as output? There have been a number of recent papers that address this problem. The authors in [2] utilize an ANOVA model and provides power calculations for various alternative models. In [4] a decision-theoretic approach is used and a hierarchical Bayes model. The authors in [8] examine the roles of technical and biological variability, in determining sample size. In [5] it is assumed that the genes are independent and have equal variance, and false discovery rates and sensitivities are reported. The ssize package [7] also assumes that the genes are independent, but uses pilot data to estimate the variance. It focuses on power and type I error. The proposal of [6] assumes independence of genes; the convenient (but unrealistic) case of equal correlation among all genes is also considered. All of these approaches may have shortcomings, namely the assumption of equal variances or independence of genes (or both). These assumptions are often violated in real microarray data and can have a real impact on sample size calculations. We avoid these assumptions in our proposal. We start with the output from a permutation-based analysis for a set of pilot data. From this we estimate the standard deviation of each gene, and the overall null distribution of the genes. Then for a given hypothesized mean difference, we estimate the false discovery rate (FDR) and false negative rate (FNR) of a list of genes. Many authors now favor the FDR over the family-wise error rate (FWER) as the appropriate error measure for microarray studies. The latter is the probability of at least one false positive call, given that we expect many false positive calls among thousands of genes, the FWER does not seem to be as relevant. Since the calculation is based on the gene scores from permutations of the data, the correlation in the genes is accounted for. Use of the permutation distribution avoids parametric assumptions about the distribution of individual genes. And by working with the scores rather than the raw data, we avoid the difficult task of simulating new data from a population having a complicated (and unknown) correlation structure. We provide interpretation of our results both in terms of FDR and FNR, and in terms power and type I error. Our proposal is implemented in the current version of the SAM package [1]. Our main focus is on microarray experiments for determining which genes are differentially expressed across two different experimental conditions, like treatment versus control. However our approach is also applicable to other settings, for example studies that correlate survival time with gene expression. We learned of the proposal in [3] from a referee; it was unknown to us at the time that this paper was written. The resampling-based approach in that paper is very close to the one described here. Some differences are a) by shifting the test-statistics rather than the data, our method is applicable beyond the two-sample problem to general settings like survival data, and b) we report not only false discovery rates but also false negative rates in our assessment of sample sizes.

The proposed method

First we need some definitions. Table 1 summarizes the outcomes of m hypothesis tests on a set of m genes.
Table 1

Possible outcomes from m hypothesis tests of a set of genes. The rows represent the true state of the population and the columns are the result a data-based decision rule.

Called Not SignificantCalled SignificantTotal
NullUVm0
Non-nullTSm1
Totalm – RRm
We have FDR = V/R and FNR = T/(m - R), power = S/m1 and type 1 error = V/m0. For simplicity, for assessing sample sizes we choose our rule so that the number of genes called significant (R) is the same as the number of non-null genes in the population (m1). This implies that 1 - power = FDR and type I error = FNR. Hence conveniently, the FDR can be interpreted as one minus the power per gene, and similarly for the FNR. Here are the details of the calculation for the two-class unpaired case (below we indicate changes necessary for other data types). Let xbe the expression for gene i in sample j; Cis the set of indices for the nsamples in group j, for j = 1 or 2. The two-sample unpaired t-statistic is where Note that this is the gene score used in the SAM method; see the Remark below regarding the exchangeability constant. If σis the true within-group standard deviation for gene i (assumed to be the same for each group), then s2 estimates Hence a shift of δ units in one gene for each sample in group 2 causes an average increase in the score dof (we assume that the proportion of samples in groups 1 and 2 remains the same as we vary the sample size). Starting with some pilot data, this suggests the following procedure for assessing sample sizes: 1. Estimate the null distribution of the scores, and the per gene standard deviation σ, by randomly permuting the class labels and recomputing the gene scores for the permuted data. 2. For k (the number of truly changed genes) running from (say) 10 to m/2, do the following: • Sample a set of m scores from the permutation distribution of the scores • Add in class 2 to a randomly chosen set of k of these scores. • Find the cutpoint c equal to the kth largest score in absolute value • Estimate the FDR and FNR of the rule |d| > c. This is straight forward since we know which genes are truly non-null (they are the ones that were incremented above). 3. Repeat Step 2 B times and report the median result for each k. We also report the 10th and 90th percentiles of the FDR across the B permutations. In our examples we use a relatively small number of repetitions (B = 20); this makes the procedure fast and gives sufficiently accurate estimates. For the two-sample problem, we typically require pilot data with at least 4 or 5 samples per class. The results of this process provide information on how the FDR and FNR will improve if the sample size were to be increased. To get an idea of what values of the mean difference δ are appropriate or reasonable, one can look at the values among the significant genes in the pilot data. This approach can be easily applied to other designs and other types of response parameters. For paired data, we take n1= n2= n/2 (remember n is the total sample size). and all of the above recipe remains the same. For one class data var = /n. For survival data and Cox's proportional hazards model, the analogue of the mean difference between groups is the numerator of the partial likelihood score statistic, which we denote by r. Hence we define the gene-specific variance via the relation var (r) = /n, and we interpret the shift parameter δ relative to r. The units of rare not very interpretable, however, so we use of pilot data as a guide. That is for example, if in our pilot data the genes that we call significant have |r| > 100, we can set δ = 100 in our sample size assessment.

Remark

In the SAM approach, the denominator sin the score (1) is replaced by s+ s0, where s0 is an exchangeability constant. It shrinks the scores of genes with expression near 0 (having s0 ≈ 0).

An example

We generated some pilot data in two classes: there were a total of 1000 genes and 20 samples, with 10 samples in each of class. Each measurement was standard Gaussian (i.e. there was no difference between the groups in the pilot data). We ran a SAM permutation analysis, assuming the data are in a log base 2 scale and specifying a mean difference of Iog2 2 = 1.0. This corresponds to a mean difference of 2 fold for class 1 versus class 2. The results are shown in Figure 1.
Figure 1

Results for simulated data. The genes are generated independently. Each panel shows the estimated FDR and FNR (solid red and green curves) as well as the 10 and 90th percentiles, using the proposed method (remember that in our setup FDR = 1-power and FNR = type I error). A horizontal line is drawn at 0.05. The quantity on the horizontal axis – number of genes – refers to both the hypothesized number of truly non-null genes, and the number of genes called significant. We see that the FDR is probably too high for the pilot data sample size of 20, but improves considerably when the sample size is doubled to 40.

Remember that the quantity on the horizontal axis – number of genes – refers to both the hypothesized number of truly non-null genes, and the number of genes called significant. We see that, depending on the number of genes truly changed at 2-fold, the sample size should be increased to 60 or 100, in order to get the FDR down to 10% or 5%. The false negative rate is consistently low throughout, when n = 60 or 100. Does our approach provide accurate estimates of FDR and FNR? For the setup of the previous example, we estimated FDR and FNR directly from repeated simulations of data from the underlying model. The results are shown in Figure 2.
Figure 2

Results for first simulation study. Here the FDR and FNR are estimated by direct simulation from underlying model.

Note the similarity between Figures 1 and 2. Of course with real data, the second method – generating data from the underlying model – would not be available, since the underlying model is unknown. Figure 3 shows a second example. Here there are 20 samples, and 10 blocks of 100 genes, with genes having pairwise correlation 0.5 in in each block. The mean structure is the same as in the previous example. We see that the FDR and FNR curves are similar to those in Figure 1, but the 10% and 90% curves are much wider. With less certainty in the estimate, it would be advisable to take a large sample size to ensure a reasonably low FDR. This illustrates the importance of preserving the correlation structure of the genes, i.e. it is not safe to make the (unrealistic) assumption of independence between the genes.
Figure 3

Results for second simulated example (correlated genes).

Discussion

We have presented a simple method for assessing sample sizes, that starts with a permutation-based analysis for some pilot data. The method gives reasonably accurate estimates of false discovery rates and false negative rates, as a function of the total number of samples. Our proposal is implemented in the SAM package- the Excel add-in and the R package samr [1].
  4 in total

1.  Power and sample size for DNA microarray studies.

Authors:  Mei-Ling Ting Lee; G A Whitmore
Journal:  Stat Med       Date:  2002-12-15       Impact factor: 2.373

2.  Sample size for gene expression microarray experiments.

Authors:  Chen-An Tsai; Sue-Jane Wang; Dung-Tsa Chen; James J Chen
Journal:  Bioinformatics       Date:  2004-11-25       Impact factor: 6.937

3.  False discovery rate, sensitivity and sample size for microarray studies.

Authors:  Yudi Pawitan; Stefan Michiels; Serge Koscielny; Arief Gusnanto; Alexander Ploner
Journal:  Bioinformatics       Date:  2005-04-19       Impact factor: 6.937

4.  FDR-controlling testing procedures and sample size determination for microarrays.

Authors:  Shuying S Li; Jeannette Bigler; Johanna W Lampe; John D Potter; Ziding Feng
Journal:  Stat Med       Date:  2005-08-15       Impact factor: 2.373

  4 in total
  41 in total

1.  Improving the quality of biomarker discovery research: the right samples and enough of them.

Authors:  Margaret S Pepe; Christopher I Li; Ziding Feng
Journal:  Cancer Epidemiol Biomarkers Prev       Date:  2015-04-02       Impact factor: 4.254

2.  Using RNA sample titrations to assess microarray platform performance and normalization techniques.

Authors:  Richard Shippy; Stephanie Fulmer-Smentek; Roderick V Jensen; Wendell D Jones; Paul K Wolber; Charles D Johnson; P Scott Pine; Cecilie Boysen; Xu Guo; Eugene Chudin; Yongming Andrew Sun; James C Willey; Jean Thierry-Mieg; Danielle Thierry-Mieg; Robert A Setterquist; Mike Wilson; Anne Bergstrom Lucas; Natalia Novoradovskaya; Adam Papallo; Yaron Turpaz; Shawn C Baker; Janet A Warrington; Leming Shi; Damir Herman
Journal:  Nat Biotechnol       Date:  2006-09       Impact factor: 54.908

3.  Practical guidelines for assessing power and false discovery rate for a fixed sample size in microarray experiments.

Authors:  Tiejun Tong; Hongyu Zhao
Journal:  Stat Med       Date:  2008-05-20       Impact factor: 2.373

4.  Computational and analytical framework for small RNA profiling by high-throughput sequencing.

Authors:  Noah Fahlgren; Christopher M Sullivan; Kristin D Kasschau; Elisabeth J Chapman; Jason S Cumbie; Taiowa A Montgomery; Sunny D Gilbert; Mark Dasenko; Tyler W H Backman; Scott A Givan; James C Carrington
Journal:  RNA       Date:  2009-03-23       Impact factor: 4.942

Review 5.  Clinical uses of microarrays in cancer research.

Authors:  Carl Virtanen; James Woodgett
Journal:  Methods Mol Med       Date:  2008

6.  Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research.

Authors:  Leif E Peterson; Matthew A Coleman
Journal:  Int J Approx Reason       Date:  2008-01       Impact factor: 3.816

7.  Assessing the validity and reproducibility of genome-scale predictions.

Authors:  Lauren A Sugden; Michael R Tackett; Yiannis A Savva; William A Thompson; Charles E Lawrence
Journal:  Bioinformatics       Date:  2013-09-17       Impact factor: 6.937

8.  Intertwining threshold settings, biological data and database knowledge to optimize the selection of differentially expressed genes from microarray.

Authors:  Paul Chuchana; Philippe Holzmuller; Frederic Vezilier; David Berthier; Isabelle Chantal; Dany Severac; Jean Loup Lemesre; Gerard Cuny; Philippe Nirdé; Bruno Bucheton
Journal:  PLoS One       Date:  2010-10-20       Impact factor: 3.240

9.  Toxicogenomic analysis of susceptibility to inhaled urban particulate matter in mice with chronic lung inflammation.

Authors:  Errol M Thomson; Andrew Williams; Carole L Yauk; Renaud Vincent
Journal:  Part Fibre Toxicol       Date:  2009-03-11       Impact factor: 9.400

10.  Power and sample size estimation in microarray studies.

Authors:  Wei-Jiun Lin; Huey-Miin Hsueh; James J Chen
Journal:  BMC Bioinformatics       Date:  2010-01-25       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.