| Literature DB >> 20100337 |
Wei-Jiun Lin1, Huey-Miin Hsueh, James J Chen.
Abstract
BACKGROUND: Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (pi1) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes.Entities:
Mesh:
Year: 2010 PMID: 20100337 PMCID: PMC2837028 DOI: 10.1186/1471-2105-11-48
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Four possible outcomes when testing m hypotheses.
| True State of Nature | Declared significant | Declared Not significant | Total |
|---|---|---|---|
| Null | |||
| Alternative | |||
| Total | |||
V is the number of true null hypotheses that are falsely rejected;
U is the number of true alternative hypotheses that are correctly rejected;
S is the number of true null hypotheses that are correctly not rejected;
T is the number of alternative hypotheses that incorrectly not rejected;
R is the total number of null hypotheses rejected among the m tests.
Average formulation versus 95% probability formulation under the independent model.a
| Average formulation: | 95% probability formulation: | ||||||
|---|---|---|---|---|---|---|---|
| 5% | 60% | 9 | 0.70 | 0.985 | 9 | 0.70 | 0.985 |
| 70% | 9 | 0.70 | 0.576 | 10 | 0.81 | 0.997 | |
| 80% | 10 | 0.81 | 0.681 | 11 | 0.88 | 0.993 | |
| 90% | 12 | 0.92 | 0.866 | 13 | 0.95 | 0.992 | |
| 10% | 60% | 8 | 0.70 | 0.999 | 8 | 0.70 | 0.999 |
| 70% | 8 | 0.71 | 0.687 | 9 | 0.82 | 1.000 | |
| 80% | 9 | 0.82 | 0.841 | 10 | 0.89 | 1.000 | |
| 90% | 11 | 0.93 | 0.977 | 11 | 0.93 | 0.977 | |
| 20% | 60% | 7 | 0.72 | 1.000 | 7 | 0.72 | 1.000 |
| 70% | 7 | 0.74 | 0.975 | 7 | 0.74 | 0.975 | |
| 80% | 8 | 0.85 | 0.996 | 8 | 0.85 | 0.996 | |
| 90% | 9 | 0.91 | 0.792 | 10 | 0.95 | 1.000 | |
a. Estimated sample size n, average sensitivity λ and probability ϕfor the specified sensitivity λ0 = 60%, 70%, 80%, 90%, under the independent model. The parameters used in the calculation were: m = 2,000, π1 = 5%, 10%, 20%, δ0 = 2 and q* = 0.05.
b. Sample size n is computed by the univariate method from Equation (1) to achieve sensitivity λ0 on average.
c. Sample size n is calculated using Tsai et al. [7] method to ensure the probability ϕof detecting at least λ0 fraction of differentially expressed genes is at least 95%.
The validation of the theoretical results from Table 2.a
| Average formulation | 95% probability formulation | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Univariate method | Binomial method | Permutation | ||||||||
| 5% | 60% | 9 | 0.0505 | 0.69 | 0.937 | 9 | 0.0505 | 0.69 | 0.937 | 11.3(0.453) |
| 70% | 9 | 0.0505 | 0.69 | 0.497 | 10 | 0.0502 | 0.80 | 0.983 | 12.5(0.507) | |
| 80% | 10 | 0.0502 | 0.80 | 0.506 | 11 | 0.0494 | 0.87 | 0.961 | 14.2(0.485) | |
| 90% | 12 | 0.0492 | 0.91 | 0.730 | 13 | 0.0484 | 0.95 | 0.965 | 17.1(0.568) | |
| 10% | 60% | 8 | 0.0490 | 0.71 | 0.997 | 8 | 0.0490 | 0.71 | 0.997 | 9.8(0.361) |
| 70% | 8 | 0.0490 | 0.71 | 0.589 | 9 | 0.0506 | 0.81 | 1.000 | 10.8(0.368) | |
| 80% | 9 | 0.0506 | 0.81 | 0.688 | 10 | 0.0503 | 0.88 | 0.999 | 12.1(0.291) | |
| 90% | 11 | 0.0497 | 0.93 | 0.921 | 11 | 0.0497 | 0.93 | 0.921 | 14.6(0.491) | |
| 20% | 60% | 7 | 0.0498 | 0.73 | 1.000 | 7 | 0.0498 | 0.73 | 1.000 | 8.0(0.089) |
| 70% | 7 | 0.0498 | 0.73 | 0.901 | 7 | 0.0498 | 0.73 | 0.901 | 9.0(0.045) | |
| 80% | 8 | 0.0491 | 0.84 | 0.966 | 8 | 0.0491 | 0.84 | 0.966 | 10.1(0.224) | |
| 90% | 9 | 0.0501 | 0.90 | 0.627 | 10 | 0.0497 | 0.94 | 0.999 | 12.2(0.384) | |
a. Empirical estimates of FDR q, average sensitivity λ, and probability ϕof the univariate method for the average formulation and of the binomial method for the 95% probability formulation. The parameters used in the calculation were: m = 2,000, δ0 = 2, and q* = 0.05.
b. Sample size n is computed by the univariate method from Equation (1) to achieve sensitivity λ0 on average.
c. Sample size n is calculated using Tsai et al. [7] method to ensure sensitivity λ0 with 95% probability.
d. Sample size n (standard deviation) is calculated using the proposed permutation method to ensure sensitivity λ0 with 95% probability with pilot study of group size 4 under the independent model.
Sample size estimates (standard deviations) for the proposed method and the Tibshirani [10] permutation method under a correlated model with effect size 2.a
| Pilot study of group size 4 | Pilot study of group | Entire data of size 62 | |||||
|---|---|---|---|---|---|---|---|
| 5% | 60% | 9 | 12.2(2.931) | 20.2(6.529) | 12.7(2.193) | 14.9(3.347) | 9.5 |
| 70% | 9 | 13.1(2.848) | 21.6(6.209) | 13.4(2.330) | 15.9(3.504) | 10.3 | |
| 80% | 10 | 14.3(3.017) | 23.6(6.399) | 14.4(2.335) | 17.2(3.547) | 11.5 | |
| 90% | 12 | 16.3(2.997) | 27.1(6.303) | 16.1(2.365) | 19.5(3.559) | 13.7 | |
| 10% | 60% | 8 | 10.9(2.409) | 15.7(4.664) | 11.5(2.015) | 12.5(2.828) | 8.1 |
| 70% | 8 | 11.8(2.544) | 16.8(4.858) | 12.1(2.096) | 13.4(2.971) | 8.8 | |
| 80% | 9 | 13.0(2.601) | 18.6(4.809) | 13.0(2.033) | 14.4(2.852) | 9.8 | |
| 90% | 11 | 14.7(2.944) | 21.5(5.250) | 14.6(2.275) | 16.4(3.099) | 11.8 | |
| 20% | 60% | 7 | 9.8(2.184) | 12.2(3.608) | 10.3(1.832) | 10.4(2.390) | 6.7 |
| 70% | 7 | 10.4(2.236) | 12.8(3.675) | 10.7(1.899) | 10.9(2.446) | 7.3 | |
| 80% | 8 | 11.4(2.414) | 14.2(3.709) | 11.6(1.995) | 11.9(2.506) | 8.2 | |
| 90% | 9 | 13.1(2.515) | 16.5(3.902) | 13.0(2.074) | 13.6(2.603) | 9.9 | |
a. The sample size estimates are based on 1,000 repetitions using the colon tumor data [14] with 4 and 6 samples from each group as pilot dataset. The parameters used in the calculation were: m = 2,000, δ0 = 2 and q* = 0.05.
b. The univariate method.
c. The proposed permutation method
d. The Tibshirani [10] permutation method.
e. The Shao and Tseng [8] model-free method using the entire 62 samples.
Empirical estimates of FDR, average sensitivity λ, and probability ϕfrom the univariate method and the proposed method based on the results of Table 4.
| Average formulation: | 95% probability formulation: | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 5% | 60% | 9 | 0.0412 | 0.65 | 0.661 | 13 | 0.0431 | 0.94 | 0.976 |
| 70% | 9 | 0.0424 | 0.65 | 0.558 | 14 | 0.0443 | 0.97 | 0.984 | |
| 80% | 10 | 0.0389 | 0.76 | 0.611 | 15 | 0.0458 | 0.99 | 0.993 | |
| 90% | 12 | 0.0460 | 0.91 | 0.743 | 17 | 0.0426 | 1.00 | 0.998 | |
| 10% | 60% | 8 | 0.0427 | 0.66 | 0.666 | 11 | 0.0474 | 0.92 | 0.964 |
| 70% | 8 | 0.0419 | 0.66 | 0.585 | 12 | 0.0478 | 0.96 | 0.973 | |
| 80% | 9 | 0.0431 | 0.78 | 0.666 | 13 | 0.0450 | 0.98 | 0.981 | |
| 90% | 11 | 0.0466 | 0.92 | 0.800 | 15 | 0.0475 | 1.00 | 0.994 | |
| 20% | 60% | 7 | 0.0433 | 0.69 | 0.711 | 10 | 0.0447 | 0.94 | 0.975 |
| 70% | 7 | 0.0448 | 0.69 | 0.634 | 11 | 0.0498 | 0.97 | 0.987 | |
| 80% | 8 | 0.0428 | 0.81 | 0.703 | 12 | 0.0496 | 0.99 | 0.994 | |
| 90% | 9 | 0.0442 | 0.89 | 0.716 | 14 | 0.0488 | 1.00 | 1.000 | |
Sample size estimates (standard deviations) for the proposed method and the Tibshirani [10] permutation method under a correlated model with effect size 1.a
| Pilot study of group | Pilot study of group | Entire data of | |||||
|---|---|---|---|---|---|---|---|
| 5% | 60% | 26 | 39.4(11.166) | 77.8(22.283) | 40.5(8.743) | 56.8(12.376) | 29.0 |
| 70% | 29 | 43.0(11.659) | 84.5(24.442) | 43.5(8.913) | 61.1(13.570) | 31.7 | |
| 80% | 33 | 48.7(13.104) | 92.2(23.398) | 47.8(9.138) | 65.4(13.134) | NaN | |
| 90% | 40 | 56.8(13.846) | 106.3(25.373) | 54.3(9.168) | 74.1(13.846) | NaN | |
| 10% | 60% | 23 | 34.9(9.140) | 60.9(18.692) | 36.8(8.074) | 48.5(12.212) | 25.0 |
| 70% | 26 | 38.8(9.821) | 66.2(18.993) | 40.0(8.408) | 52.0(11.819) | 27.8 | |
| 80% | 29 | 43.3(10.399) | 72.5(18.492) | 43.3(8.475) | 55.8(11.662) | 31.4 | |
| 90% | 35 | 50.2(10.649) | 83.7(20.271) | 49.5(8.593) | 64.0(12.485) | NaN | |
| 20% | 60% | 19 | 31.1(9.066) | 47.0(14.301) | 32.6(7.572) | 39.8(9.552) | 20.7 |
| 70% | 22 | 34.4(8.740) | 50.4(14.156) | 35.7(7.816) | 42.6(9.570) | 23.4 | |
| 80% | 25 | 38.6(9.611) | 55.5(15.393) | 39.0(7.766) | 46.6(10.415) | 27 | |
| 90% | 31 | 44.7(9.655) | 63.6(14.919) | 44.5(7.999) | 52.3(10.313) | 32.3 | |
a. The sample size estimates are based on 1,000 repetitions using the colon tumor data [14] with 4 and 6 samples from each group as pilot dataset. The parameters used in the calculation were: m = 2,000, δ0 = 1 and q* = 0.05.
b. The univariate method.
c. The proposed permutation method
d. The Tibshirani [10] permutation method.
e. The Shao and Tseng [8] model-free method using the entire 62 samples.