| Literature DB >> 20565750 |
Ilari Scheinin1, José A Ferreira, Sakari Knuutila, Gerrit A Meijer, Mark A van de Wiel, Bauke Ylstra.
Abstract
BACKGROUND: Determining a suitable sample size is an important step in the planning of microarray experiments. Increasing the number of arrays gives more statistical power, but adds to the total cost of the experiment. Several approaches for sample size determination have been developed for expression array studies, but so far none has been proposed for array comparative genomic hybridization (aCGH).Entities:
Mesh:
Year: 2010 PMID: 20565750 PMCID: PMC2911457 DOI: 10.1186/1471-2105-11-331
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Evaluation data sets
| Data Set | Array Type | Probes | Regions | Cancer Type | Groups (Samples) |
|---|---|---|---|---|---|
| Chin | spotted oligo | 26,755 | 223 | breast | ER+ (113) |
| Douglas | BAC | 3,032 | 142 | colorectal | MSI (7) |
| Fridlyand | BAC | 1,877 | 231 | breast | TP53+ (10) |
| Myllykangas | cDNA | 11,342 | 260 | gastric | diffuse (15) |
| Nymark | cDNA | 10,953 | 242 | lung | asbestos-exposed (11) |
| Postma | spotted oligo | 26,755 | 111 | colorectal | good (16) |
| Smeets | BAC | 4,196 | 143 | head and neck | HPV+ (12) |
| Wrage | spotted oligo | 25,549 | 23 | lung | BM+ (13) |
| Simulation 0 | in-situ oligo | 42,331 | 440 | (15) | |
| Simulation 5 | in-situ oligo | 42,331 | 489 | (15) | |
| Simulation 10 | in-situ oligo | 42,331 | 525 | (15) |
Eight public data sets were collected to evaluate the performance of CGHpower. They represented five different cancer types and BAC, cDNA and oligo-based microarray platforms, with resolutions varying from 2 K to 27 K array elements. The last column contains the distinguishing factor used to divide the data set into two groups, along with the number of arrays in each group. The simulated data sets were generated by introducing artificial aberrations into a set of clinical genetics samples. A total of 11 simulations were generated, and the remaining ones are available at http://www.cangem.org/cghpower/. ER = estrogen receptor, MSI = microsatellite instability, CIN = chromosomal instability, HPV = human papilloma virus, BM = bone marrow metastasis.
Figure 1Power calculations for evaluation data sets. Average power estimated as a function of sample size for the eight evaluation data sets and three simulations. False discovery rate was fixed at 10%. The horizontal position of the small symbols mark the actual size of the data set that was used to calculate the estimates in each case. Real data sets are shown with solid lines and three of the simulations with dotted lines. Additional simulations are available at http://www.cangem.org/cghpower/.
Figure 2Diagnostic plots. The goodness-of-fit of the two estimators of G and densities of p-values for three data sets illustrating different scenarios in the performance of CGHpower. The data set of Douglas et al. shows A) a satisfactory goodness-of-fit following from B) a convex p-value density function. Mediocre operation is demonstrated with the data set of Postma et al. C) An inferior fit results from D) a p-value density which shows a slight increase for small values, but is not convex as expected. Nymark et al. represents failed execution. E) The disagreement between the G estimators is slightly more severe and the estimated power curve is a flat line (Figure 1). F) P-values exhibit even less density at low values than would be expected by chance. In such circumstances, it is recommended that data preprocessing be carried out before uploading and only the power calculations part be performed in CGHpower.
Figure 3Consistency as the pilot size is increased. To evaluate whether power estimates obtained from smaller pilots are in fact representative of larger data sets, the calculations were performed with subsets of the Chin et al. data. Resampling without replacement was used to obtain subsets from 10% to 90% of the original data set. Each resampling was repeated ten times and results averaged. The horizontal position of the small symbols mark the size of the subset used to obtain each power estimate.