Literature DB >> 25908791

ESPRESSO: taking into account assessment errors on outcome and exposures in power analysis for association studies.

Amadou Gaye¹, Thomas W Y Burton², Paul R Burton¹.

Abstract

MOTIVATION: Very large studies are required to provide sufficiently big sample sizes for adequately powered association analyses. This can be an expensive undertaking and it is important that an accurate sample size is identified. For more realistic sample size calculation and power analysis, the impact of unmeasured aetiological determinants and the quality of measurement of both outcome and explanatory variables should be taken into account. Conventional methods to analyse power use closed-form solutions that are not flexible enough to cater for all of these elements easily. They often result in a potentially substantial overestimation of the actual power.
RESULTS: In this article, we describe the Estimating Sample-size and Power in R by Exploring Simulated Study Outcomes tool that allows assessment errors in power calculation under various biomedical scenarios to be incorporated. We also report a real world analysis where we used this tool to answer an important strategic question for an existing cohort.
AVAILABILITY AND IMPLEMENTATION: The software is available for online calculation and downloads at http://espresso-research.org. The code is freely available at https://github.com/ESPRESSO-research. CONTACT: louqman@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 25908791 PMCID： PMC4528636 DOI： 10.1093/bioinformatics/btv219

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

A critical question to answer when designing or extending large scale studies and biobanks is what sample size is required to achieve adequate statistical power (i.e. the ability to detect the true determinants of the outcome). For a large study aimed at exploring weak effects, the answer can have major implications for funding and resources because the number of participants to be recruited (sample size) depends critically on the desired level of power; the higher the desired power, the larger the required sample size. Among other things, the answer also depends on the quality of the measurements of the variables of interest since assessment errors, in both outcome and explanatory variables, can substantially reduce the power of association studies (Wong ). Failure to take account of such assessment errors in power analyses at the design stage of a study may lead to a serious over-estimation of its true statistical power and result in a research platform that is critically underpowered when it comes to analysis. Most conventional approaches to estimating the sample size required to achieve adequate statistical power fail to account rigorously for the sensitivity and specificity of the assessment of categorical outcome and explanatory variables or the reliability of quantitative variables (Burton ). Furthermore, proper incorporation of the impact of assessment error in multiple variables, and of the analytic disturbance that may arise from variables that have not been measured at all, generally demands non-trivial extension of the methods being used and so is not attempted. But given the magnitude of the sample sizes that may well be required, the vast investment of time and resources that is needed, and the scientific and financial costs associated with significantly misjudging required study size, the failure to properly account for these factors can seriously undermine strategic planning. These effects may be substantial. For example, having made entirely reasonable assumptions about issues such as the measurement error in outcome and explanatory variables and heterogeneity in disease risk it has been shown that, for a study with an intended statistical power of 80%, the required sample size can easily more than double (Burton ). But, if a study designed to have a power of 80% to detect a given effect at a (two-tailed) P-value of 0.01 is only half its required size, it is straightforward to calculate that the actual power of that study will be given by the cumulative standard normal standard distribution up to the quantile: Where 2.576 is the normal standard deviate corresponding to a two-tailed P-value of 0.01, 0.842 is the deviate corresponding to a power of 80%, and reflects the shrinkage in the standard error if sample size doubles. This quantile (−0.159) corresponds to a real power of 43.6% (rather than the intended 80%). Equivalent calculations for candidate gene studies using a (two-tailed) P-value of P < 10−4 or genome wide association studies using P < 10−8 generate corresponding powers of 29.3 and 13.9%, respectively. ESPRESSO—Estimating Sample-size and Power in R by Exploring Simulated Study Outcomes—(Burton ) has been developed to provide a way to incorporate key elements and thereby to allow for more realistic power analysis and sample size calculation for large-scale epidemiological studies. It is a software tool, written in the R programming language (Ihaka and Gentleman, 1996), providing a simulation-based approach to power and sample size calculations for stand-alone studies, analyses nested in cohort studies and consortium-based meta-analyses. ESPRESSO is aimed primarily at researchers involved in designing and setting up studies to investigate the genetic and/or environmental basis of complex traits. In particular, it enables those designing large cohorts and biobanks to better estimate the sample size required to achieve adequate power. ESPRESSO can also allow for reviewers and funding bodies to verify the statistical power calculations put forward by researchers in their grant applications, thereby helping to ensure that resources are not wasted on incorrectly powered studies. In this article, we provide a concise introduction to the new ESPRESSO which is built as an R and web-based software. This new version was used in recent analysis by Gaye , and in this article, we illustrate its use by answering a question put through to us by the Canadian Partnership for Tomorrow (CPT) cohort.

2 The new ESPRESSO and its implementation

This section reports concisely the work undertaken to produce the new ESPRESSO version from the initial R script by Burton et al. The new version was built with the aim to (i) provide a more comprehensive and user friendly R tool accessible to a wider audience; (ii) allow for analyses with quantitative outcomes and quantitative environmental exposures and (iii) extend the range of biomedical scenarios that can be investigated, particularly enabling more interaction models.

2.1 From the precursor R script to fully fledged R libraries

The original version of ESPRESS0 (Burton ) consisted of a single R script which allowed for the fitting of a model with a binary outcome and two covariates (one single nucleotide polymorphism (SNP) and one binary environmental exposure) by calling a text file that held the input parameters for the calculations. This programming paradigm was fine for the initial project but a more flexible solution was required for the analyses of the wider range of biomedical scenarios sought under the new version of the tool. Hence, it was necessary to write functions to cope with the increased complexity. Whilst the initial R script allowed for the investigation of five biomedical scenarios, the newly build R packages (libraries) enable power and sample size calculations for a total of 14 scenarios reported in Table 1. Six R packages each with a dozen of functions were developed. The open source code of the packages, available freely from GitHub (https://github.com/ESPRESSO-research), allows for users proficient in the R programming language to download the tool and use it as is, or modify the code in ways that best suit their needs.

Table 1.

	Additive genetic variant (GA)	Binary genetic variant (GB)	Quantitative environmental exposure (EQ)	Binary environmental exposure (EB)
Additive genetic variant (GA)	GA × GA
Binary genetic variant (GB)	GB × GA	GB × GB
Quantitative environmental exposure (EQ)	EQ × GA	EQ × GB	EQ × EQ
Binary environmental exposure (EB)	EB × GA	EB × GB	EB × EQ	EB × EB

The main effect scenarios that can be investigated are on the cells in the first column whilst the interactions scenarios are in the inner cells of the table.

Overview of the models/scenarios that could be investigated with the initial ESPRESSO script (GA, GB, EB, EB × GA and EB × GB) versus those that are enabled under the new R libraries (all other nine models) The main effect scenarios that can be investigated are on the cells in the first column whilst the interactions scenarios are in the inner cells of the table.

2.2 The new parameters

The initial parameters of ESPRESSO have already been detailed by Burton . Therefore, this section focuses on five new key parameters that were required in the new version along with some of the considerations for choosing how these parameters might be set. In the new version of ESPRESSO, the outcome variable and the environmental determinants can be modelled as quantitative. Hence, the parameters ‘pheno.reliability’ and ‘env.reliability’ were introduced to set the level of uncertainty on the outcome and covariates measurements. These represent the reliability (test–retest reliability) of the assessment of a quantitative variable, which is a characteristic that reflects the consistency of the observed measurement across several repeats. The new version of the tool allows for users to model two SNPs (rather than one in the initial script) as being in linkage disequilibrium (LD). To enable LD between the two SNPs, two new parameters (LD and targetLD) were added to the list of input parameters. The parameter ‘LD’ is a binary indicator: set to 1 to introduce LD between the two SNPs, and to 0 to generate two independent SNPs. The correlated SNPs are generated using a multivariate normal distribution function and the method developed in the R package HapSim (Montana, 2005). HapSim models a haplotype as a multivariate random variable with known marginal distributions and pairwise correlation coefficients. The package allows for the simulation of a SNP haplotype of several biallelic loci. In our implementation of the method for ESPRESSO, we limited the number of loci to two because we generate only two SNPs in LD. Our implementation of the method consists of two main steps: (i) we compute the covariance matrix required to generate two correlated binary vectors of length n (each vector represents one SNP and n is the number of observations) and (ii) we use the covariance computed in step (1) to generate a matrix of data that follow a multivariate normal distribution. For two loci, there are four possible haplotypes; the sum of the frequencies of the four possible haplotypes across the n simulated individual is 1 and the Lewontin’s D and Pearson’s r correlation values calculated from the single frequencies of the four haplotypes is equal to the target level of correlation (desired level of LD) specified in the first step. If the number of individuals to simulate is large, the programme runs more slowly. Because this setting (i.e. LD between the two SNPs) could be time consuming, it is preferable to set the sample size and the number of simulations to low values for an initial explorative analysis. The parameter ‘target LD’ represents the desired level of LD between the two SNPs, if they are to be modelled as being LD. The user should consider the minor allele frequencies (MAFs) of the two SNPs when setting the desired level of LD. The minor allele frequencies of the SNPs should not be markedly different. This, simply because a prefect correlation (i.e. an absolute value of 1) cannot be obtained if the SNPs have markedly different minor allele frequencies as demonstrated mathematically in the Section 1.1 of the Supplementary Material.

2.3 The web-based version of ESPRESSO

The development of R packages improved greatly the usability of the tool and its maintenance (error tracking and debugging), but it was mainly confined to the R community. In order to widen the use of ESPRESSO it was important to make it accessible to non R users. We hence built a website based on Joomla content management system (Joomla, 2014) and embedded an online version of the tool (http://www.espresso-research.org/). The website contains extensive documentation about the tool and help information are available, upon a click, for each item on the graphical user interface of the calculator. The version of the ESPRESSO software running in the background of the page can be updated by simply installing the latest packages on the server where the page is hosted.

2.4 Overview of the ESPRESSO algorithm

An ESPRESSO simulation essentially comprises five steps as summarized graphically in Figure 1.

Fig. 1.

Flowchart that shows the main steps in an ESPRESSO process

Flowchart that shows the main steps in an ESPRESSO process First (step 1 in Fig. 1), a series of input values required to set the simulation (e.g. number of runs), outcome, genetic and/or environmental determinant parameters are specified. Then (step 2 Fig. 1); an error free dataset which contains the true outcome and determinant values for each simulated individual is generated. The word ‘true’ here refers not to the true value of some real individual in the real world, but rather the true values (without error) of each simulated variable in each simulated subject within ESPRESSO. If the outcome is binary, the ‘true’ outcome may be perturbed by heterogeneity in the base-line risk of disease arising from the impact of unmeasured determinants that do not themselves appear in the model. At the next stage of the process (step 3 in Fig. 1) an error is generated and added to the true data to produce the ‘observed’ data. This error, in effect, disturbs (or may disturb) the observed values of the outcome variable and each of the covariates. The structure and magnitude of the error depends on the input parameters—for example, reflecting the presumed sensitivity and specificity of the assessments of binary variables or of the reliability of quantitative measures. Then (step 4 in Fig. 1), the observed data generated in the simulation stage are analysed by generalized linear modelling (GLM). Steps 2, 3 and 4 are repeated for a number of times equal to the number of runs specified at step 1. After each run, as shown graphically in Figure 2, a matrix, D, of observed data is generated and analysed by GLM, and the estimates (beta, standard error and z-score), obtained from the GLM fit, are stored in three distinct vectors.

Fig. 2.

Graphical view of the GLM analysis in ESPRESSO. After each simulation run a dataset of observed values is generated analysed and the beta coefficient, standard error and z-statistic stored

Graphical view of the GLM analysis in ESPRESSO. After each simulation run a dataset of observed values is generated analysed and the beta coefficient, standard error and z-statistic stored Finally (step 5 in Fig. 1), the sample size required to achieve the desired power specified at step 1 and the empirical and theoretical (modelled power achievable with the input sample size, also specified at step 1) are calculated. The sample size required is the product of the relative change in standard error needed to achieve the desired power by the input sample size. The empirical power is the proportion of runs in which the z-statistic for the parameter of interest exceeds the z-statistic for the desired level of statistical significance. The theoretical power is the probability that the z-statistic obtained from the GLM fit takes any value less than or equal to the ratio of the mean beta coefficient to the mean standard error, i.e. it is the cumulative distribution function associated with the z-statistic obtained from the GLM fit. The empirical power is not informative for extreme values of the standard error of the log odds ratio; in such cases one should consider the theoretical power.

3 Case study: analysis of the power of CPT cohort project to study quantitative traits

3.1 The CPT project

The CPT is a pan-Canadian initiative funded by the Canadian Partnership Against Cancer (CPAC). It aims to create a national biobank/bio-repository to provide a platform for future research on common chronic disease including cancer and cardiovascular disease (Borugian ). CPT is based on the integration of five large provincial cohorts each recruiting several tens of thousands of middle-aged participants. The planned (target), current and projected final recruitment numbers for the CPT project (at the time of this analysis) are outlined in Table 2 (Borugian ). The research team was aiming for a final sample size somewhere in the range 110 000–180 000 but was uncertain what inferential benefits would accrue from being towards the top of that range rather than towards the bottom.

Table 2.

Configuration of the CPT cohort: sample size at the time of this analysis and target sample size

Name	Age-range at recruitment	Target sample size	Sample size by the time of this analysis
Atlantic cohort	40–69 years	30 000	15 000–25 000
British Columbia cohort	40–69 years	40 000	15 000–25 000
CARTaGENE (Quebec)	35–69 years	20 000	20 000
Ontario health survey (Ontario)	35–69 years	15 0 000	35 000–70 000
The tomorrow project (Alberta)	40–69 years	50 000	25 000–40 000
CPT project overall	Predominantly 35–69 years	250 000	110 000–180 000

Configuration of the CPT cohort: sample size at the time of this analysis and target sample size

3.2 Aim of the analysis and rational for using ESPRESSO

The aim of this analysis was to assess the statistical power of CPT as a platform for research projects exploring quantitative traits as outcomes, given its ultimate sample size. The analysis was requested by CPT to inform the primary (and immediate) strategic decisions to be made by CPAC on whether to continue recruitment at a rate that was likely to produce a total of approximately 110 000 participants by the end of March 2012, or alternatively to prioritise and step up recruitment with the aim of recruiting as many as 180 000. The ESPRESSO platform was used to carry out the calculations because unlike standard approaches it takes account of uncertainties around outcome and covariates measurements as mentioned in the introductory section of the article.

3.3 The outcome variables investigated

The exploration of the power profiles of CPT for quantitative outcomes was based on the estimates of participant-to-participant variation (standard deviation) for an extensive range of critical disease-related traits. These traits were originally drawn up for the power calculations of the CARTaGENE project (CARTaGENE, 2008) carried out under the direction of P. R. Burton. Forty-three outcome variables were investigated; they are either physical measures or biochemical and haematological parameters. An analysis of biochemical and haematological parameters on fresh blood is useful because: (i) it provides a series of valuable quantitative traits which are meaningful in their own right as complex traits that are worthy of aetiological study; (ii) it provides a series of quantitative traits that reflect intermediate traits that lie on the causal pathways leading to a number of complex binary traits that are of scientific interest; (iii) it includes a number of ‘health screening’ parameters that are of interest to potential recruits and therefore provide a tangible ‘return’ for agreeing to participate. The scientific rationale that justified the inclusion of each of the 43 variables in this study are the same as those that justified their inclusion in the CARTaGENE project (CARTaGENE, 2008). The rationale and the assumed distribution (mean and standard deviation) of each of the variables are available from earlier work (CARTaGENE, 2008; Gaye, 2012). This section serves mainly as an illustration of one of the possible uses of ESPRESSO, so the results are restricted to two outcome variables [systolic blood pressure (SBP) and a generic standardized variable]. The analyses of the other variables were conducted using the same strategy with full results reported elsewhere (Gaye, 2012).

3.4 Methods

3.4.1 The biomedical scenarios investigated and the strategy

The power profiles of CPT were analysed for the six scenarios summarized in Table 3. Because the fitted model is linear (i.e. continuous outcome), it is easy to mathematically calculate the minimum estimated effects from the first effect obtained empirically; however, we deliberately chose to obtain all the minimum estimated effects empirically. For each outcome and under each scenario, an iterative approach was used in ESPRESSO to determine the minimum estimated effect that can be detected with an empirical power of 80 ± 2% using the probable final samples sizes of CPT (110 000 and 180 000). The iterative approach consisted of looping through a range of effect sizes until reaching the smallest effect that ensures a power of 80%; these minimum effects were referred to as minimum detectable effect sizes (MDESs).

Table 3.

The six scenarios explored to construct each power profile

Scenario	Minor allele frequency	Prevalence of ‘at risk’ environmental exposure	Mathematical model
1. Common determinants	0.30	0.50	Main effects only
2. Moderately common determinants	0.10	0.20	Main effects only
3. Uncommon determinants	0.05	0.10	Main effects only
4. Common determinants	0.30	0.50	Main effects + interaction
5. Moderately common determinants	0.10	0.20	Main effects + interaction
6. Uncommon determinants	0.05	0.10	Main effects + interaction

The six scenarios explored to construct each power profile For each of the six scenarios in Table 3, the outcome, exposures and simulation parameters are described under Section 2 of the Supplementary Material and the values these parameters were set at are available under Section 4 of the Supplementary Material.

3.4.2 Analytic assumptions about the outcome and the genetic and environmental determinants

For each of the scenarios 1, 2 and 3 in Table 3, the GLM model fitted in ESPRESSO consists of one outcome (the quantitative trait being analysed) and one covariate; the three scenarios were analysed twice, once with a SNP as covariate and once with an environmental factor as covariate. The genetic determinants were modelled as SNPs using an additive genetic model, as is now most commonly used (Wellcome Trust Case Control ): thus, the covariate was effectively modelled linearly as an ordinal variable taking a value 0, 1 or 2 indicating the number of minor alleles carried by each simulated subject. The genotyping error is generated as follows: we consider an observed marker that is not in complete LD with an unobserved causal variant (r2 < 1); hence, the observed marker does not carry all of the information held by the unobserved causal variant. When the observed marker is typed, it is as if the unobserved causal variant has been typed with some error whose magnitude increases with decreasing LD. It is this error that we consider as the genotyping error. In other words, the genotyping error was taken as being equivalent to the error that arises when the genotype at a locus of interest is inferred from the genotype of an observed marker (with the same allelic distribution) that is in LD with the unobserved causal variant at the same locus of interest with an r2 value of 0.8. This corresponds to the weakest LD with HapMap 2 markers on the Affymetrix 500 K chip (Barrett and Cardon, 2006). The environmental determinants were modelled as binary, and the measurement error was introduced by assuming an underlying latent variable with a reliability of 0.7. This reflects a moderate level of measurement error corresponding, for example, to blood pressure measurement in the Intersalt Study (Dyer ). Gene-environment interactions were modelled using product terms again assuming an additive genetic model. Significance tests for genetic main effects and interactions were based on P-value < 0.0001 (i.e. assuming vague candidate genes) or P-value < 10−7 (genome wide association studies); these P-values are considered conservative enough for genetic determinants (Pearson and Manolio, 2008; Pharoah ; Risch and Merikangas, 1996; Storey and Tibshirani, 2003). Non-genetic effects were tested at P-value < 0.01, a P-value more stringent than 0.05 the value widely used in association studies (Burton ; Ioannidis, 2005). Unless otherwise specified, power estimation was based on the standard deviation and on the measurement reliability of the trait being considered as obtained from the analysis of the CARTaGENE optimization phase. When no firm evidence to the contrary was available to determine the likely measurement reliability of the quantitative trait being considered, it was taken to be 0.7.

3.5 Results

It is important to understand how the MDES should be interpreted. To illustrate the interpretation, consider Table 4 which provides an abstract of the power profile of CPT to investigate SBP as a quantitative outcome measured conventionally in a clinic setting (the results for all the scenarios on Table 3 are reported under Section 3 of the Supplementary Material). Conventional (peripheral) blood pressure is measured as the mean of three measurements. The device chosen (Colin Prodigy II Vital Signs Monitor OM-2200) is an automated device that uses the oscillometric method for assessing blood pressure.

Table 4.

Minimal detectable effect sizes for SBP with 110 000 participants

	Genetic main effect	Environment main effect
P-value	10⁻⁷ (GWAS)	0.01
Moderately common determinants	1.0433	0.8123

The classification of the determinants as moderately common refers to the MAF of the genetic determinant (0.1) and the prevalence of the environmental (0.2), respectively, as reported on Table 3.

Minimal detectable effect sizes for SBP with 110 000 participants The classification of the determinants as moderately common refers to the MAF of the genetic determinant (0.1) and the prevalence of the environmental (0.2), respectively, as reported on Table 3. The population distribution of the variable reported in Table 4 (SBP) is: mean = 126 mm/Hg and SD = 18.2. In the body of the table, the MDES for the environmental main effect for the moderately common exposure was reported as 0.8123 mm/Hg. This scenario (Table 3) invokes a binary environmental exposure with a prevalence of 0.2 (20%). The reported results therefore imply that if the final sample size of CPT was 110 000 participants, if conventional clinic blood pressure was measured using the standard operating protocol (SOP) outlined in the first paragraph above, and if scientific interest focused on the impact of a binary environmental exposure which had realistic characteristics corresponding to those outlined in Section 3.4.2, the power calculations would indicate that there was an 80% chance of detecting, at P-value < 0.01, a real effect of that environmental determinant corresponding to an increase (or decrease) in SBP of 0.8123 mm/Hg, on average. Similarly, if interest were focused on a moderately common SNP in a genome wide association study (GWAS) there would be an 80% chance of detecting the effect of a SNP with a minor allele frequency of 0.10 (10%) at P-value < 10−7 (for genome-wide inference) if that SNP really increased or decreased SBP by at least 1.0433 mm/Hg.

3.6 Conclusions

Given the magnitudes of the effect sizes that could be detected using the entire data of the CPT project, the cohort has a good potential to study the aetiological architecture of quantitative traits. Scrutiny of the power profiles of the quantitative variables tabulated individually demonstrates that given a sample size of 110 000 or 180 000, genetic and environmental main effects associated with any quantitative variables that are collected across the whole CPT project can potentially be studied with substantial power—effect sizes as small as 1/12th of a standard deviation will be reliably detectable even under the most challenging scenario (uncommon genotype, i.e. MAF = 0.05, with testing at P-value<10−7 under GWAS) (Bansal ; Bodmer and Bonilla, 2008; Burton ). But, as would be anticipated (Burton ; Wong ) the power to detect gene-environment interactions is considerably less strong. Given the central relevance of such interactions, it is important to note that a sample size of 180 000 rather than 110 000 would markedly enhance the capacity to study gene-environment interactions when the interacting determinants are both other than common. The larger sample size will also allow additional scope for data sub-setting.

4 Discussions

The ESPRESSO software allows for elements that are not taken into account in conventional power calculators to be included more readily in the power and sample size calculations for stand-alone case-control and cohort studies as well as for case-control analyses nested in cohort studies. The new version of ESPRESSO is implemented as open source R packages to allow researchers proficient in the R programming language to use it in a flexible way and give them the ability to access and alter the code to answer further scientific questions that require some modification to the downloadable version. This new version comes also with an online, menu-driven, interface for non R users or for those who just prefer a Graphic User Interface. With the current version of ESPRESSO, it is not possible to carry out power calculations, for genetic association studies, that precisely represent reality, i.e. when an analysis might choose to model a substantive number of genetic variants at the same time. This is because the current version of the tool currently allows for the modelling of no more than two genetic exposures and two environment/life style exposures at one time. However, given Mendelian randomization which ensures that SNPs must be closely located to be correlated because of LD, and given the large number of participants in most studies in which ESPRESSO might be used, which means that the residual error structure will be affected very little by the degrees of freedom used up by including even tens of genetic covariates, we feel this is not a serious problem. There would only be a problem if the inferences based on using just two SNPs in isolation were to differ substantially from those derived if all SNPs were to be considered together and in many settings there is no substantial difference at all. For example, if a GWAS analysis deals sequentially with one million separate variant-disease associations, it is perfectly acceptable to model one of those associations in isolation.

4.1 Limitations and future work

The version of ESPRESSO presented here does not allow for power and sample size calculation where the genetic component is a complex haplotypes (i.e. several loci in LD). With the current version at most only two loci in LD can be modelled and this is not very realistic representation of haplotype blocks. So the current ESPRESSO approach is potentially restrictive in this regard. It is therefore desirable to extend the software by implementing additional methods that do enable the joint consideration of a larger number of SNPs in LD. This could potentially be done by adapting the method developed by Montana in the R package Hapsim (Montana, 2005) which we mentioned in Section 2.2. In the current version of the tool, the process consists of simulating a dataset with a number of specific characteristics and seeing in what proportion of the simulations the effect of interest is detected. As part of further work, we consider exploiting this feature to build in a function that allows for the estimation of false discovery rate (FDR) by first determining the proportion of false discoveries among all the discoveries and then calculating the FDR as defined by Benjamini and Hochberg (Benjamini and Hochberg, 1995).

15 in total

1. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

2. The detection of gene-environment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement?

Authors: M Y Wong; N E Day; J A Luan; K P Chan; N J Wareham
Journal: Int J Epidemiol Date: 2003-02 Impact factor: 7.196

3. Evaluating coverage of genome-wide association studies.

Authors: Jeffrey C Barrett; Lon R Cardon
Journal: Nat Genet Date: 2006-05-21 Impact factor: 38.330

4. The future of genetic studies of complex human diseases.

Authors: N Risch; K Merikangas
Journal: Science Date: 1996-09-13 Impact factor: 47.728

5. Urinary electrolyte excretion in 24 hours and blood pressure in the INTERSALT Study. II. Estimates of electrolyte-blood pressure associations corrected for regression dilution bias. The INTERSALT Cooperative Research Group.

Authors: A R Dyer; P Elliott; M Shipley
Journal: Am J Epidemiol Date: 1994-05-01 Impact factor: 4.897

Review 6. Statistical analysis strategies for association studies involving rare variants.

Authors: Vikas Bansal; Ondrej Libiger; Ali Torkamani; Nicholas J Schork
Journal: Nat Rev Genet Date: 2010-10-13 Impact factor: 53.242

7. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls.

Authors: Nick Craddock; Matthew E Hurles; Niall Cardin; Richard D Pearson; Vincent Plagnol; Samuel Robson; Damjan Vukcevic; Chris Barnes; Donald F Conrad; Eleni Giannoulatou; Chris Holmes; Jonathan L Marchini; Kathy Stirrups; Martin D Tobin; Louise V Wain; Chris Yau; Jan Aerts; Tariq Ahmad; T Daniel Andrews; Hazel Arbury; Anthony Attwood; Adam Auton; Stephen G Ball; Anthony J Balmforth; Jeffrey C Barrett; Inês Barroso; Anne Barton; Amanda J Bennett; Sanjeev Bhaskar; Katarzyna Blaszczyk; John Bowes; Oliver J Brand; Peter S Braund; Francesca Bredin; Gerome Breen; Morris J Brown; Ian N Bruce; Jaswinder Bull; Oliver S Burren; John Burton; Jake Byrnes; Sian Caesar; Chris M Clee; Alison J Coffey; John M C Connell; Jason D Cooper; Anna F Dominiczak; Kate Downes; Hazel E Drummond; Darshna Dudakia; Andrew Dunham; Bernadette Ebbs; Diana Eccles; Sarah Edkins; Cathryn Edwards; Anna Elliot; Paul Emery; David M Evans; Gareth Evans; Steve Eyre; Anne Farmer; I Nicol Ferrier; Lars Feuk; Tomas Fitzgerald; Edward Flynn; Alistair Forbes; Liz Forty; Jayne A Franklyn; Rachel M Freathy; Polly Gibbs; Paul Gilbert; Omer Gokumen; Katherine Gordon-Smith; Emma Gray; Elaine Green; Chris J Groves; Detelina Grozeva; Rhian Gwilliam; Anita Hall; Naomi Hammond; Matt Hardy; Pile Harrison; Neelam Hassanali; Husam Hebaishi; Sarah Hines; Anne Hinks; Graham A Hitman; Lynne Hocking; Eleanor Howard; Philip Howard; Joanna M M Howson; Debbie Hughes; Sarah Hunt; John D Isaacs; Mahim Jain; Derek P Jewell; Toby Johnson; Jennifer D Jolley; Ian R Jones; Lisa A Jones; George Kirov; Cordelia F Langford; Hana Lango-Allen; G Mark Lathrop; James Lee; Kate L Lee; Charlie Lees; Kevin Lewis; Cecilia M Lindgren; Meeta Maisuria-Armer; Julian Maller; John Mansfield; Paul Martin; Dunecan C O Massey; Wendy L McArdle; Peter McGuffin; Kirsten E McLay; Alex Mentzer; Michael L Mimmack; Ann E Morgan; Andrew P Morris; Craig Mowat; Simon Myers; William Newman; Elaine R Nimmo; Michael C O'Donovan; Abiodun Onipinla; Ifejinelo Onyiah; Nigel R Ovington; Michael J Owen; Kimmo Palin; Kirstie Parnell; David Pernet; John R B Perry; Anne Phillips; Dalila Pinto; Natalie J Prescott; Inga Prokopenko; Michael A Quail; Suzanne Rafelt; Nigel W Rayner; Richard Redon; David M Reid; Susan M Ring; Neil Robertson; Ellie Russell; David St Clair; Jennifer G Sambrook; Jeremy D Sanderson; Helen Schuilenburg; Carol E Scott; Richard Scott; Sheila Seal; Sue Shaw-Hawkins; Beverley M Shields; Matthew J Simmonds; Debbie J Smyth; Elilan Somaskantharajah; Katarina Spanova; Sophia Steer; Jonathan Stephens; Helen E Stevens; Millicent A Stone; Zhan Su; Deborah P M Symmons; John R Thompson; Wendy Thomson; Mary E Travers; Clare Turnbull; Armand Valsesia; Mark Walker; Neil M Walker; Chris Wallace; Margaret Warren-Perry; Nicholas A Watkins; John Webster; Michael N Weedon; Anthony G Wilson; Matthew Woodburn; B Paul Wordsworth; Allan H Young; Eleftheria Zeggini; Nigel P Carter; Timothy M Frayling; Charles Lee; Gil McVean; Patricia B Munroe; Aarno Palotie; Stephen J Sawcer; Stephen W Scherer; David P Strachan; Chris Tyler-Smith; Matthew A Brown; Paul R Burton; Mark J Caulfield; Alastair Compston; Martin Farrall; Stephen C L Gough; Alistair S Hall; Andrew T Hattersley; Adrian V S Hill; Christopher G Mathew; Marcus Pembrey; Jack Satsangi; Michael R Stratton; Jane Worthington; Panos Deloukas; Audrey Duncanson; Dominic P Kwiatkowski; Mark I McCarthy; Willem Ouwehand; Miles Parkes; Nazneen Rahman; John A Todd; Nilesh J Samani; Peter Donnelly
Journal: Nature Date: 2010-04-01 Impact factor: 49.962

Review 8. Common and rare variants in multifactorial susceptibility to common diseases.

Authors: Walter Bodmer; Carolina Bonilla
Journal: Nat Genet Date: 2008-06 Impact factor: 38.330

9. Size matters: just how big is BIG?: Quantifying realistic sample size requirements for human genome epidemiology.

Authors: Paul R Burton; Anna L Hansell; Isabel Fortier; Teri A Manolio; Muin J Khoury; Julian Little; Paul Elliott
Journal: Int J Epidemiol Date: 2008-08-01 Impact factor: 7.196

10. Understanding the impact of pre-analytic variation in haematological and clinical chemistry analytes on the power of association studies.

Authors: Amadou Gaye; Tim Peakman; Martin D Tobin; Paul R Burton
Journal: Int J Epidemiol Date: 2014-08-01 Impact factor: 7.196

3 in total

1. Power and Sample Size Calculations for Genetic Association Studies in the Presence of Genetic Model Misspecification.

Authors: Camille M Moore; Sean A Jacobson; Tasha E Fingerlin
Journal: Hum Hered Date: 2020-07-28 Impact factor: 0.444

Review 2. Impact of 2, 3, 5, 4'-tetrahydroxystilbene-2-O-β-D-glucoside on cognitive deficits in animal models of Alzheimer's disease: a systematic review.

Authors: Chenxia Sheng; Weijun Peng; Zeqi Chen; Yucheng Cao; Wei Gong; Zi-An Xia; Yang Wang; Nanxiang Su; Zhe Wang
Journal: BMC Complement Altern Med Date: 2016-08-26 Impact factor: 3.659

3. Genetic model misspecification in genetic association studies.

Authors: Amadou Gaye; Sharon K Davis
Journal: BMC Res Notes Date: 2017-11-07

3 in total