Literature DB >> 17597934

Is there an alternative to increasing the sample size in microarray studies?

Lev Klebanov1, Andrei Yakovlev.   

Abstract

Our answer to the question posed in the title is negative. This intentionally provocative note discusses the issue of sample size in microarray studies from several angles. We suggest that the current view of microarrays as no more than a screening tool be changed and small sample studies no longer be considered appropriate.

Entities:  

Year:  2007        PMID: 17597934      PMCID: PMC1896058          DOI: 10.6026/97320630001429

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background

As obvious from recent literature, in the decade since the advent of microarray technology, the enthusiasm about this technology has substantially subsided. The titles of papers like “An Array of Problems [1] or “Getting the Noise out of Gene Arrays” [2] published in high profile journals speak for themselves. A growing number of such publications reflects a frustration among biologists who spend too much effort and money pursuing false leads while missing many important findings. Is it the microarray technology or the way it has been used that is to blame for the current attitude towards the as yet emerging methodologies for the generation and analysis of high throughput data in genomics and proteomics? Notwithstanding the fact that contemporary microarray technology still calls for substantial improvements in both the quality of measurements and accuracy of probe set definitions, this powerful technology provides a rich source of multidimensional information on the functioning of the whole genome machinery at the level of transcription. Nonetheless, it is typically employed as a simplistic screening tool with a focus on individual gene profiling. Unfortunately, even this limited goal cannot be achieved with currently practiced sample sizes for the following two reasons. First, all multiple testing procedures are very unstable in the presence of correlations between gene expression levels, which is the main factor causing instability of gene lists. By and large the more liberal the procedure, the more unstable the adjusted p -values. [3,4] This effect is exacerbated in small samples. Therefore, the actual number of false discoveries is not well controlled even if strong control is guaranteed in terms of expected values. Second, follow-up confirmatory studies can handle only Type 1 errors while a lack of power is a much more serious problem. When the sample size is small and the number of tests is large, the power of multiple testing procedures is extremely poor, so we tend to report only unusually strong effects while missing an uncontrollable number of biologically significant findings. It is still quite common in biological publications to report microarray data on a small number of subjects. All papers claiming “consistency” or, conversely, “inconsistency” of the results produced by different microarray platforms draw their conclusions from just 3-6 arrays (subjects) per group. We would like to emphasize that it is not the technical noise that represents the main hurdle, it is the biological variability that calls for larger samples. A recently published report of the MicroArray Quality Control (MAQC) Consortium [5] provides direct measurements of the technical noise specific for the Affymetrix platform in the absence of biological variability. We estimated standard deviations of logtransformed expression levels associated with all genes (probe sets) from technical replicates produced by the MAQC study. They appear to be symmetrically distributed across genes. The resultant average (over genes) equaling 0.11 is at least three times as small as a typical minimal (let alone the mean) standard deviation in the presence of biological variability. A log-additive random noise as small as this exerts almost no effect on the results of statistical analysis in general and estimates of correlation coefficients in gene pairs in particular. This is definitely good albeit belated news to the scientific community. However, the inter-subject variability is beyond the control of the manufacturer and can be accounted for only through sampling from a general population of subjects. The use of small samples significantly diminishes the utility of microarray analysis. The following simplistic experiment with real biological data illustrates this point. We used microarray data reporting expression levels of 12558 probe sets in two groups of patients with different types of childhood leukemia (Hyperdip and TALL) identified through the St. Jude Children's Research Hospital Database. [6] There were 88 and 45 subjects in these groups, respectively. To mimic a small-sample setting, five hundred (B=500) pseudo-independent subsamples of n=5 arrays were drawn repeatedly without replacement from each group. Differentially expressed genes were selected from each pair of subsamples by a t-test with Bonferroni adjustment at a nominal level of the family-wise error rate of 0.05. This experiment resulted in the mean number of rejections equaling 3.63, while the standard deviation was equal to 7.69, indicating a very high variability of the results of testing. To remedy the situation, investigators resort to bioinformatics tools that utilize prior biological knowledge, such as partially known pathways, for prioritization of candidate genes. However, the current biological knowledge is still limited and inaccurate. This way of validation serves as a reasonable underpinning for statistical inference (limited to Type 1 errors) but not a rigorous method. When the biologist has some preliminary idea of what specific set of genes to look at, the significance analysis becomes confirmatory, thereby dramatically reducing the magnitude of multiple testing problems. This is the basic idea behind the Gene Set Enrichment Analysis (see [7] and references therein). However, this approach is limited to pre-defined gene sets and does not offer an alternative to the much needed exploratory tools. The problem of differential expression is neither unique nor the most important one in the analysis of microarrays. The magnitude of differential expression does not necessarily indicate biological significance, so that the price for nondiscoveries is difficult to assess. By limiting the use of microarrays to screening purposes, we do not unveil the true potential of this resourceful technology. It is imperative for statisticians to be able to extract more information from microarray data in order to prioritize genes in a more meaningful way. In particular, additional information can be provided by gene pairs rather than individual genes. It is noteworthy that recent years have seen a growing interest in correlations between gene expression levels in statistical methodologies for microarray analysis ([8-11] and many others). We suggest that the focus of future efforts be switched to the formation of a vector of different attributes that can be assigned to each gene in order to provide more information for gene prioritization beyond changes in the marginal distribution across phenotypes. The components of this vector might be adjusted p-values resulting from various statistical tests, prevalence of a specific type of correlation with other genes, relevance to the known pathways, etc. This will allow the investigator to initially increase the target set of genes by including more biologically meaningful features and then to narrow it down by putting such pieces of information together and generating a final output in an automated fashion. Such an endeavor is feasible only if larger samples become more readily available. Statisticians have never insisted on increasing the sample size vigorously enough. Instead, many attempts have been made to overcome the sample size and cost limitations by means of mathematics. Such methodological endeavors invariably resort to the idea of pooling the information on gene expression across genes. While some of them are quite elegant, it has become clear that the actual correlation structure of microarray data is a barrier to their real world applications and this barrier seems to be insurmountable at this point in time. We have contributed to the discussion of this issue with several publications [12-14], providing evidence that the variability of the results of testing based on such methods may be extremely high. This variability manifests itself in the number of rejected hypotheses and estimated values of the false discovery rate. As a consequence, one may declare 1500 genes differentially expressed while there are none. [13] It is correlations between gene expression signals that cause this kind of instability because they are not only strong but also longranged, involving thousands and sometimes tens of thousands of genes that form pairs with each particular gene. [10,15] The long-range strong correlation prevails in a huge proportion of randomly selected genes. Pooling strategies such as the Empirical Bayes method may work for cluster dependent data [16], but not in the presence of long-range dependencies. Unfortunately, there is no theoretical way to justify the required minimal sample size. We share the opinion of Yang and Speed [17] that power calculations are of little utility in microarray studies. The main point is that microarray analysis is exploratory (not confirmatory!) by nature and the most essential components of the standard power calculations (such as preliminary information on the expected effect sizes, variability, and the number of affected genes) are absent. [17] It is our strong conviction that small sample sizes in microarray studies are a serious handicap to the progress of modern genomics. However trivial the above statement may sound, its importance remains unrealized by practitioners. We have expressed this concern before in connection with the MAQC study. [18] At the same time, there is a growing understanding of the importance of replication in microarray experiments and many large databases are being created in different areas of biomedical research. Now we are facing a new era in this field of data analysis, an era of large data sets. The future of microarray technology hinges on our ability to respond to this challenge.
  18 in total

Review 1.  Design issues for cDNA microarray experiments.

Authors:  Yee Hwa Yang; Terry Speed
Journal:  Nat Rev Genet       Date:  2002-08       Impact factor: 53.242

2.  Getting the noise out of gene arrays.

Authors:  Eliot Marshall
Journal:  Science       Date:  2004-10-22       Impact factor: 47.728

3.  An array of problems.

Authors:  Simon Frantz
Journal:  Nat Rev Drug Discov       Date:  2005-05       Impact factor: 84.694

Review 4.  Utility of correlation measures in analysis of gene expression.

Authors:  Anthony Almudevar; Lev B Klebanov; Xing Qiu; Peter Salzman; Andrei Y Yakovlev
Journal:  NeuroRx       Date:  2006-07

5.  Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes.

Authors:  Xing Qiu; Lev Klebanov; Andrei Yakovlev
Journal:  Stat Appl Genet Mol Biol       Date:  2005-11-22

6.  Statistical methods and microarray data.

Authors:  Lev Klebanov; Xing Qiu; Stephen Welle; Andrei Yakovlev
Journal:  Nat Biotechnol       Date:  2007-01       Impact factor: 54.908

7.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.

Authors:  Leming Shi; Laura H Reid; Wendell D Jones; Richard Shippy; Janet A Warrington; Shawn C Baker; Patrick J Collins; Francoise de Longueville; Ernest S Kawasaki; Kathleen Y Lee; Yuling Luo; Yongming Andrew Sun; James C Willey; Robert A Setterquist; Gavin M Fischer; Weida Tong; Yvonne P Dragan; David J Dix; Felix W Frueh; Frederico M Goodsaid; Damir Herman; Roderick V Jensen; Charles D Johnson; Edward K Lobenhofer; Raj K Puri; Uwe Schrf; Jean Thierry-Mieg; Charles Wang; Mike Wilson; Paul K Wolber; Lu Zhang; Shashi Amur; Wenjun Bao; Catalin C Barbacioru; Anne Bergstrom Lucas; Vincent Bertholet; Cecilie Boysen; Bud Bromley; Donna Brown; Alan Brunner; Roger Canales; Xiaoxi Megan Cao; Thomas A Cebula; James J Chen; Jing Cheng; Tzu-Ming Chu; Eugene Chudin; John Corson; J Christopher Corton; Lisa J Croner; Christopher Davies; Timothy S Davison; Glenda Delenstarr; Xutao Deng; David Dorris; Aron C Eklund; Xiao-hui Fan; Hong Fang; Stephanie Fulmer-Smentek; James C Fuscoe; Kathryn Gallagher; Weigong Ge; Lei Guo; Xu Guo; Janet Hager; Paul K Haje; Jing Han; Tao Han; Heather C Harbottle; Stephen C Harris; Eli Hatchwell; Craig A Hauser; Susan Hester; Huixiao Hong; Patrick Hurban; Scott A Jackson; Hanlee Ji; Charles R Knight; Winston P Kuo; J Eugene LeClerc; Shawn Levy; Quan-Zhen Li; Chunmei Liu; Ying Liu; Michael J Lombardi; Yunqing Ma; Scott R Magnuson; Botoul Maqsodi; Tim McDaniel; Nan Mei; Ola Myklebost; Baitang Ning; Natalia Novoradovskaya; Michael S Orr; Terry W Osborn; Adam Papallo; Tucker A Patterson; Roger G Perkins; Elizabeth H Peters; Ron Peterson; Kenneth L Philips; P Scott Pine; Lajos Pusztai; Feng Qian; Hongzu Ren; Mitch Rosen; Barry A Rosenzweig; Raymond R Samaha; Mark Schena; Gary P Schroth; Svetlana Shchegrova; Dave D Smith; Frank Staedtler; Zhenqiang Su; Hongmei Sun; Zoltan Szallasi; Zivana Tezak; Danielle Thierry-Mieg; Karol L Thompson; Irina Tikhonova; Yaron Turpaz; Beena Vallanat; Christophe Van; Stephen J Walker; Sue Jane Wang; Yonghong Wang; Russ Wolfinger; Alex Wong; Jie Wu; Chunlin Xiao; Qian Xie; Jun Xu; Wen Yang; Liang Zhang; Sheng Zhong; Yaping Zong; William Slikker
Journal:  Nat Biotechnol       Date:  2006-09       Impact factor: 54.908

8.  Some comments on instability of false discovery rate estimation.

Authors:  Xing Qiu; Andrei Yakovlev
Journal:  J Bioinform Comput Biol       Date:  2006-10       Impact factor: 1.122

9.  Empirical Bayes screening of many p-values with applications to microarray studies.

Authors:  Susmita Datta; Somnath Datta
Journal:  Bioinformatics       Date:  2005-02-02       Impact factor: 6.937

10.  The effects of normalization on the correlation structure of microarray data.

Authors:  Xing Qiu; Andrew I Brooks; Lev Klebanov; Ndrei Yakovlev
Journal:  BMC Bioinformatics       Date:  2005-05-16       Impact factor: 3.169

View more
  7 in total

Review 1.  Cardiovascular genomics: a biomarker identification pipeline.

Authors:  John H Phan; Chang F Quo; May Dongmei Wang
Journal:  IEEE Trans Inf Technol Biomed       Date:  2012-05-16

2.  Analysis of DNA microarray expression data.

Authors:  Richard Simon
Journal:  Best Pract Res Clin Haematol       Date:  2009-06       Impact factor: 3.020

3.  Skeletal muscle gene expression in response to resistance exercise: sex specific regulation.

Authors:  Dongmei Liu; Maureen A Sartor; Gustavo A Nader; Laurie Gutmann; Mary K Treutelaar; Emidio E Pistilli; Heidi B Iglayreger; Charles F Burant; Eric P Hoffman; Paul M Gordon
Journal:  BMC Genomics       Date:  2010-11-24       Impact factor: 3.969

4.  Exploiting dependencies of pairwise comparison outcomes to predict patterns of gene response.

Authors:  Nam S Vo; Vinhthuy Phan
Journal:  BMC Bioinformatics       Date:  2014-10-21       Impact factor: 3.169

5.  Effects of sample size on differential gene expression, rank order and prediction accuracy of a gene signature.

Authors:  Cynthia Stretch; Sheehan Khan; Nasimeh Asgarian; Roman Eisner; Saman Vaisipour; Sambasivarao Damaraju; Kathryn Graham; Oliver F Bathe; Helen Steed; Russell Greiner; Vickie E Baracos
Journal:  PLoS One       Date:  2013-06-03       Impact factor: 3.240

6.  How high is the level of technical noise in microarray data?

Authors:  Lev Klebanov; Andrei Yakovlev
Journal:  Biol Direct       Date:  2007-04-11       Impact factor: 4.540

7.  Modeling the metabolic interplay between a parasitic worm and its bacterial endosymbiont allows the identification of novel drug targets.

Authors:  David M Curran; Alexandra Grote; Nirvana Nursimulu; Adam Geber; Dennis Voronin; Drew R Jones; Elodie Ghedin; John Parkinson
Journal:  Elife       Date:  2020-08-11       Impact factor: 8.140

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.