Literature DB >> 23990413

DIST: direct imputation of summary statistics for unmeasured SNPs.

Donghyung Lee1, T Bernard Bigdeli, Brien P Riley, Ayman H Fanous, Silviu-Alin Bacanu.   

Abstract

MOTIVATION: Genotype imputation methods are used to enhance the resolution of genome-wide association studies, and thus increase the detection rate for genetic signals. Although most studies report all univariate summary statistics, many of them limit the access to subject-level genotypes. Because such an access is required by all genotype imputation methods, it is helpful to develop methods that impute summary statistics without going through the interim step of imputing genotypes. Even when subject-level genotypes are available, due to the substantial computational cost of the typical genotype imputation, there is a need for faster imputation methods.
RESULTS: Direct Imputation of summary STatistics (DIST) imputes the summary statistics of untyped variants without first imputing their subject-level genotypes. This is achieved by (i) using the conditional expectation formula for multivariate normal variates and (ii) using the correlation structure from a relevant reference population. When compared with genotype imputation methods, DIST (i) requires only a fraction of their computational resources, (ii) has comparable imputation accuracy for independent subjects and (iii) is readily applicable to the imputation of association statistics coming from large pedigree data. Thus, the proposed application is useful for a fast imputation of summary results for (i) studies of unrelated subjects, which (a) do not provide subject-level genotypes or (b) have a large size and (ii) family association studies.
AVAILABILITY AND IMPLEMENTATION: Pre-compiled executables built under commonly used operating systems are publicly available at http://code.google.com/p/dist/. CONTACT: dlee4@vcu.edu .

Entities:  

Mesh:

Year:  2013        PMID: 23990413      PMCID: PMC3810851          DOI: 10.1093/bioinformatics/btt500

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Genome-wide association studies (GWASs) have been successful in detecting associations between genetic variants and complex diseases (Hindorff ). However, GWASs genotype only a fraction of the tens of millions of single nucleotide polymorphisms (SNPs) found in the human genome. To increase resolution, and thus the detection rate for genetic signals, researchers proposed imputing genotypes at numerous untyped (unmeasured) SNPs (de Bakker ; Li ; Marchini and Howie, 2010). Most commonly used genotype imputation tools, e.g. IMPUTE2 (Howie ), MACH (Li ), BIMBAM (Servin and Stephens, 2007) and BEAGLE (Browning and Browning, 2007), are based on Hidden Markov Models (HMMs). Although these methods are accurate, due to their need for haplotypic phasing of all subjects in the study, they are extremely burdensome computationally. Their computational burden would become even more extreme with the ever increasing size of studies and reference panels. Other genotype imputation methods [PLINK (Purcell ), SNPMSTAT (Lin ), UNPHASED (Dudbridge, 2008) and TUNA (Wen and Nicolae, 2008)] are based on multinomial models (MMs) of haplotype frequencies instead of HMM. These methods are simpler and faster, but their imputation accuracy is generally lower than the accuracy of HMM-based methods (Marchini and Howie, 2008, 2010). Recently, researchers proposed a new MM-based imputation method, called BLIMP, which imputes genotypes/allele frequencies for unmeasured SNPs by using the conditional expectation formula for multivariate normal variates (Wen and Stephens, 2010). Regardless of their model usage, all these imputation tools require a two-stage procedure that (i) imputes subject-level genotypes at the unmeasured SNPs on the basis of genotypes at measured SNPs and a relevant reference population [e.g. 1000 Genomes (1KG) (Altshuler )] and (ii) tests for association between imputed genotypes and phenotype of interest. However, this procedure requires access to subject-level genotypes, which are often unavailable. To directly impute summary statistics while (i) substantially reducing the computational burden and (ii) retaining imputation accuracy, we propose Direct Imputation of summary STatistics (DIST). DIST avoids the imputation of subject-level genotypes by directly applying, to unmeasured SNP statistics, the classical conditional expectation formula for multivariate normal variates.

2 SOFTWARE

DIST imputes the statistics at the unmeasured SNPs in a prediction window as a function of (i) the statistics at measured SNPs in a larger window, henceforth denoted as extended window, and (ii) the correlation matrix of both measured and unmeasured statistics, as estimated from a relevant reference dataset. Similar to BLIMP (Wen and Stephens, 2010), DIST uses the conditional mean formula for multivariate normal variates. Unlike BLIMP, DIST applies the formula (i) directly to summary statistics and (ii) under the null hypothesis, i.e. the hypothesis under which the distribution of all statistical tests is computed. This novel application allows the imputation of summary statistics without imputing subject-level genotypes (see Section 1.1 in Supplementary Material). To reduce computation time, DIST is implemented in C++, which ensures that it can be easily used under Linux, Windows, MacOS etc. DIST takes as input a file containing the (normally distributed) GWAS/meta-analysis summary statistics (see Section 1.5 of Supplementary Material for more information on obtaining the normally distributed statistics). It provides a command line interface with various options for specifying the number of measured SNPs to be contained in (i) the prediction window and (ii) each side region of the extended window, and so forth (Supplementary Table S1).

3 RESULTS

We compare the performance of DIST with the performance of typical HMM and MM methods. Given the speed and accuracy of SHAPEIT phasing (Delaneau ), we chose IMPUTE2 as the representative for the HMM-based methods. Given its wide availability, we chose PLINK as the representative for the MM-based methods. Thus, we compare the performance of DIST, IMPUTE2 and PLINK (at default settings) to predict 99 imputed height SNPs in 25 realistic simulations under both the null and the alternative hypothesis (Fig. 1). The phenotypes (height) of 5000 subjects are simulated as a function of the effects at the 180 significant SNPs from the height meta-analysis (Lango ) (Section 2 in Supplementary Material). Imputations used Europeans in 1KG as the reference sample and were performed on a single Linux machine (Intel Xeon 2.67 GHz processor and 64 GB of RAM).
Fig. 1.

Imputed Z-scores as a function of the true Z-scores by imputation method (strip), under the null (red) and the alternative (blue) hypothesis

Imputed Z-scores as a function of the true Z-scores by imputation method (strip), under the null (red) and the alternative (blue) hypothesis The average accuracy of imputed Z-scores in the above 50 simulations, as measured by the squared correlation coefficient (r2) between imputed and true Z-scores, is high for DIST (0.98) and IMPUTE2 (0.99) and moderately high for PLINK (0.92) (Fig. 1). The average running time per simulation was 76 min for DIST, 270/965 min for imputing/pre-phasing for IMPUTE2 and 3971 min for PLINK. The maximum memory requirement was 52 MB for DIST, ∼9500 MB for IMPUTE2 and 5300 MB for PLINK. [DIST and IMPUTE2 were also used to impute the statistics for 5% SNPs missing at random on chromosome 22 of a dataset of 5000 subjects (which included the data described in the next paragraph); DIST required 33 min of running time and at most 283 MB of memory, and IMPUTE2 required 846/3437 min for imputation/pre-phasing and 9470 MB of memory.] To impute untyped statistics, DIST requires only the joint correlation matrix for the statistics at typed and untyped SNPs. Because this matrix does not depend on the relationship between subjects in the study, unlike genotype imputation methods, DIST can be readily used to impute statistics for family association studies. We illustrate this advantage by applying the method to a proprietary Irish alcohol dependence study sample consisting of 1755 controls and 710 cases from 431 Irish families. The subjects were genotyped using Affymetrix 6.0 SNP array, and the association statistics were computed using MQLS (Thornton and McPeek, 2007). To impute unmeasured SNPs, we used UK10K (www.uk10k.org) as the reference panel (Supplementary Fig. S2).

4 CONCLUSIONS

DIST is a novel tool for direct imputation of summary statistics at untyped SNPs. When compared with genotype imputation methods, DIST (i) does not need access to subject-level genotypes, (ii) provides comparable imputation accuracy while substantially shortening the running time and (iii) can be readily applied to family association statistics. Consequently, DIST is useful for investigators who need fast and fairly accurate access to imputation-based P-values but (i) do not have access to subject-level genotypes, (ii) do not want to go through the laborious process of imputing subject-level genotypes or (iii) have association statistics coming from (large) pedigree data. Unlike genotype imputation methods, as the available reference panels are increasing in size, DIST can avoid incurring large increases in running time/memory by storing the local correlation structures into pre-computed tables. When compared with genotype imputation methods, DIST uses a smaller imputation window and requires that study and reference populations to be well matched. Thus, when access to subject-level genotypic data is available, genotype imputation methods are likely to outperform DIST (i) for regions with long-range linkage disequilibrium, e.g. major histocompatibility complex locus, and (ii) when the study and the reference populations are not well matched. Consequently, whenever possible/appropriate, we recommend to follow-up DIST signals using a genotype imputation method. Funding: Virginia Commonwealth University start-up fund (to S.A.B.) Conflict of Interest: none declared.
  18 in total

1.  A linear complexity phasing method for thousands of genomes.

Authors:  Olivier Delaneau; Jonathan Marchini; Jean-François Zagury
Journal:  Nat Methods       Date:  2011-12-04       Impact factor: 28.547

Review 2.  Genotype imputation for genome-wide association studies.

Authors:  Jonathan Marchini; Bryan Howie
Journal:  Nat Rev Genet       Date:  2010-07       Impact factor: 53.242

3.  Simple and efficient analysis of disease association with missing genotype data.

Authors:  D Y Lin; Y Hu; B E Huang
Journal:  Am J Hum Genet       Date:  2008-02       Impact factor: 11.025

4.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

Authors:  Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio
Journal:  Proc Natl Acad Sci U S A       Date:  2009-05-27       Impact factor: 11.205

5.  USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA.

Authors:  Xiaoquan Wen; Matthew Stephens
Journal:  Ann Appl Stat       Date:  2010-09       Impact factor: 2.083

6.  Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data.

Authors:  Frank Dudbridge
Journal:  Hum Hered       Date:  2008-03-31       Impact factor: 0.444

Review 7.  Genotype imputation.

Authors:  Yun Li; Cristen Willer; Serena Sanna; Gonçalo Abecasis
Journal:  Annu Rev Genomics Hum Genet       Date:  2009       Impact factor: 8.929

8.  Association studies for untyped markers with TUNA.

Authors:  Xiaoquan Wen; Dan L Nicolae
Journal:  Bioinformatics       Date:  2007-12-05       Impact factor: 6.937

9.  Hundreds of variants clustered in genomic loci and biological pathways affect human height.

Authors:  Hana Lango Allen; Karol Estrada; Guillaume Lettre; Sonja I Berndt; Michael N Weedon; Fernando Rivadeneira; Cristen J Willer; Anne U Jackson; Sailaja Vedantam; Soumya Raychaudhuri; Teresa Ferreira; Andrew R Wood; Robert J Weyant; Ayellet V Segrè; Elizabeth K Speliotes; Eleanor Wheeler; Nicole Soranzo; Ju-Hyun Park; Jian Yang; Daniel Gudbjartsson; Nancy L Heard-Costa; Joshua C Randall; Lu Qi; Albert Vernon Smith; Reedik Mägi; Tomi Pastinen; Liming Liang; Iris M Heid; Jian'an Luan; Gudmar Thorleifsson; Thomas W Winkler; Michael E Goddard; Ken Sin Lo; Cameron Palmer; Tsegaselassie Workalemahu; Yurii S Aulchenko; Asa Johansson; M Carola Zillikens; Mary F Feitosa; Tõnu Esko; Toby Johnson; Shamika Ketkar; Peter Kraft; Massimo Mangino; Inga Prokopenko; Devin Absher; Eva Albrecht; Florian Ernst; Nicole L Glazer; Caroline Hayward; Jouke-Jan Hottenga; Kevin B Jacobs; Joshua W Knowles; Zoltán Kutalik; Keri L Monda; Ozren Polasek; Michael Preuss; Nigel W Rayner; Neil R Robertson; Valgerdur Steinthorsdottir; Jonathan P Tyrer; Benjamin F Voight; Fredrik Wiklund; Jianfeng Xu; Jing Hua Zhao; Dale R Nyholt; Niina Pellikka; Markus Perola; John R B Perry; Ida Surakka; Mari-Liis Tammesoo; Elizabeth L Altmaier; Najaf Amin; Thor Aspelund; Tushar Bhangale; Gabrielle Boucher; Daniel I Chasman; Constance Chen; Lachlan Coin; Matthew N Cooper; Anna L Dixon; Quince Gibson; Elin Grundberg; Ke Hao; M Juhani Junttila; Lee M Kaplan; Johannes Kettunen; Inke R König; Tony Kwan; Robert W Lawrence; Douglas F Levinson; Mattias Lorentzon; Barbara McKnight; Andrew P Morris; Martina Müller; Julius Suh Ngwa; Shaun Purcell; Suzanne Rafelt; Rany M Salem; Erika Salvi; Serena Sanna; Jianxin Shi; Ulla Sovio; John R Thompson; Michael C Turchin; Liesbeth Vandenput; Dominique J Verlaan; Veronique Vitart; Charles C White; Andreas Ziegler; Peter Almgren; Anthony J Balmforth; Harry Campbell; Lorena Citterio; Alessandro De Grandi; Anna Dominiczak; Jubao Duan; Paul Elliott; Roberto Elosua; Johan G Eriksson; Nelson B Freimer; Eco J C Geus; Nicola Glorioso; Shen Haiqing; Anna-Liisa Hartikainen; Aki S Havulinna; Andrew A Hicks; Jennie Hui; Wilmar Igl; Thomas Illig; Antti Jula; Eero Kajantie; Tuomas O Kilpeläinen; Markku Koiranen; Ivana Kolcic; Seppo Koskinen; Peter Kovacs; Jaana Laitinen; Jianjun Liu; Marja-Liisa Lokki; Ana Marusic; Andrea Maschio; Thomas Meitinger; Antonella Mulas; Guillaume Paré; Alex N Parker; John F Peden; Astrid Petersmann; Irene Pichler; Kirsi H Pietiläinen; Anneli Pouta; Martin Ridderstråle; Jerome I Rotter; Jennifer G Sambrook; Alan R Sanders; Carsten Oliver Schmidt; Juha Sinisalo; Jan H Smit; Heather M Stringham; G Bragi Walters; Elisabeth Widen; Sarah H Wild; Gonneke Willemsen; Laura Zagato; Lina Zgaga; Paavo Zitting; Helene Alavere; Martin Farrall; Wendy L McArdle; Mari Nelis; Marjolein J Peters; Samuli Ripatti; Joyce B J van Meurs; Katja K Aben; Kristin G Ardlie; Jacques S Beckmann; John P Beilby; Richard N Bergman; Sven Bergmann; Francis S Collins; Daniele Cusi; Martin den Heijer; Gudny Eiriksdottir; Pablo V Gejman; Alistair S Hall; Anders Hamsten; Heikki V Huikuri; Carlos Iribarren; Mika Kähönen; Jaakko Kaprio; Sekar Kathiresan; Lambertus Kiemeney; Thomas Kocher; Lenore J Launer; Terho Lehtimäki; Olle Melander; Tom H Mosley; Arthur W Musk; Markku S Nieminen; Christopher J O'Donnell; Claes Ohlsson; Ben Oostra; Lyle J Palmer; Olli Raitakari; Paul M Ridker; John D Rioux; Aila Rissanen; Carlo Rivolta; Heribert Schunkert; Alan R Shuldiner; David S Siscovick; Michael Stumvoll; Anke Tönjes; Jaakko Tuomilehto; Gert-Jan van Ommen; Jorma Viikari; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael A Province; Manfred Kayser; Alice M Arnold; Larry D Atwood; Eric Boerwinkle; Stephen J Chanock; Panos Deloukas; Christian Gieger; Henrik Grönberg; Per Hall; Andrew T Hattersley; Christian Hengstenberg; Wolfgang Hoffman; G Mark Lathrop; Veikko Salomaa; Stefan Schreiber; Manuela Uda; Dawn Waterworth; Alan F Wright; Themistocles L Assimes; Inês Barroso; Albert Hofman; Karen L Mohlke; Dorret I Boomsma; Mark J Caulfield; L Adrienne Cupples; Jeanette Erdmann; Caroline S Fox; Vilmundur Gudnason; Ulf Gyllensten; Tamara B Harris; Richard B Hayes; Marjo-Riitta Jarvelin; Vincent Mooser; Patricia B Munroe; Willem H Ouwehand; Brenda W Penninx; Peter P Pramstaller; Thomas Quertermous; Igor Rudan; Nilesh J Samani; Timothy D Spector; Henry Völzke; Hugh Watkins; James F Wilson; Leif C Groop; Talin Haritunians; Frank B Hu; Robert C Kaplan; Andres Metspalu; Kari E North; David Schlessinger; Nicholas J Wareham; David J Hunter; Jeffrey R O'Connell; David P Strachan; H-Erich Wichmann; Ingrid B Borecki; Cornelia M van Duijn; Eric E Schadt; Unnur Thorsteinsdottir; Leena Peltonen; André G Uitterlinden; Peter M Visscher; Nilanjan Chatterjee; Ruth J F Loos; Michael Boehnke; Mark I McCarthy; Erik Ingelsson; Cecilia M Lindgren; Gonçalo R Abecasis; Kari Stefansson; Timothy M Frayling; Joel N Hirschhorn
Journal:  Nature       Date:  2010-09-29       Impact factor: 49.962

10.  Imputation-based analysis of association studies: candidate regions and quantitative traits.

Authors:  Bertrand Servin; Matthew Stephens
Journal:  PLoS Genet       Date:  2007-05-30       Impact factor: 5.917

View more
  36 in total

1.  FAPI: Fast and accurate P-value Imputation for genome-wide association study.

Authors:  Johnny S H Kwan; Miao-Xin Li; Jia-En Deng; Pak C Sham
Journal:  Eur J Hum Genet       Date:  2015-08-26       Impact factor: 4.246

2.  Meta-analysis of Positive and Negative Symptoms Reveals Schizophrenia Modifier Genes.

Authors:  Alexis C Edwards; Tim B Bigdeli; Anna R Docherty; Silviu Bacanu; Donghyung Lee; Teresa R de Candia; Arden Moscati; Dawn L Thiselton; Brion S Maher; Brandon K Wormley; Dermot Walsh; Francis A O'Neill; Kenneth S Kendler; Brien P Riley; Ayman H Fanous
Journal:  Schizophr Bull       Date:  2015-08-27       Impact factor: 9.306

3.  A simple and accurate method to determine genomewide significance for association tests in sequencing studies.

Authors:  Dan-Yu Lin
Journal:  Genet Epidemiol       Date:  2019-01-08       Impact factor: 2.135

4.  DISSCO: direct imputation of summary statistics allowing covariates.

Authors:  Zheng Xu; Qing Duan; Song Yan; Wei Chen; Mingyao Li; Ethan Lange; Yun Li
Journal:  Bioinformatics       Date:  2015-03-24       Impact factor: 6.937

5.  Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases.

Authors:  Alexander Gusev; S Hong Lee; Gosia Trynka; Hilary Finucane; Bjarni J Vilhjálmsson; Han Xu; Chongzhi Zang; Stephan Ripke; Brendan Bulik-Sullivan; Eli Stahl; Anna K Kähler; Christina M Hultman; Shaun M Purcell; Steven A McCarroll; Mark Daly; Bogdan Pasaniuc; Patrick F Sullivan; Benjamin M Neale; Naomi R Wray; Soumya Raychaudhuri; Alkes L Price
Journal:  Am J Hum Genet       Date:  2014-11-06       Impact factor: 11.025

6.  RAISS: robust and accurate imputation from summary statistics.

Authors:  Hanna Julienne; Huwenbo Shi; Bogdan Pasaniuc; Hugues Aschard
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

7.  A powerful subset-based method identifies gene set associations and improves interpretation in UK Biobank.

Authors:  Diptavo Dutta; Peter VandeHaar; Lars G Fritsche; Sebastian Zöllner; Michael Boehnke; Laura J Scott; Seunggeun Lee
Journal:  Am J Hum Genet       Date:  2021-03-16       Impact factor: 11.025

8.  Perspective: Longevity, Stress, Genes and African Americans.

Authors:  Keith E Whitfield; Roland J Thorpe
Journal:  Ethn Dis       Date:  2017-01-19       Impact factor: 1.847

9.  A fine-mapping study of central obesity loci incorporating functional annotation and imputation.

Authors:  Xiaoyu Zhang; L Adrienne Cupples; Ching-Ti Liu
Journal:  Eur J Hum Genet       Date:  2018-07-02       Impact factor: 4.246

10.  BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES.

Authors:  Xiang Zhu; Matthew Stephens
Journal:  Ann Appl Stat       Date:  2017-10-05       Impact factor: 2.083

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.