MOTIVATION: Genotype imputation methods are used to enhance the resolution of genome-wide association studies, and thus increase the detection rate for genetic signals. Although most studies report all univariate summary statistics, many of them limit the access to subject-level genotypes. Because such an access is required by all genotype imputation methods, it is helpful to develop methods that impute summary statistics without going through the interim step of imputing genotypes. Even when subject-level genotypes are available, due to the substantial computational cost of the typical genotype imputation, there is a need for faster imputation methods. RESULTS: Direct Imputation of summary STatistics (DIST) imputes the summary statistics of untyped variants without first imputing their subject-level genotypes. This is achieved by (i) using the conditional expectation formula for multivariate normal variates and (ii) using the correlation structure from a relevant reference population. When compared with genotype imputation methods, DIST (i) requires only a fraction of their computational resources, (ii) has comparable imputation accuracy for independent subjects and (iii) is readily applicable to the imputation of association statistics coming from large pedigree data. Thus, the proposed application is useful for a fast imputation of summary results for (i) studies of unrelated subjects, which (a) do not provide subject-level genotypes or (b) have a large size and (ii) family association studies. AVAILABILITY AND IMPLEMENTATION: Pre-compiled executables built under commonly used operating systems are publicly available at http://code.google.com/p/dist/. CONTACT: dlee4@vcu.edu .
MOTIVATION: Genotype imputation methods are used to enhance the resolution of genome-wide association studies, and thus increase the detection rate for genetic signals. Although most studies report all univariate summary statistics, many of them limit the access to subject-level genotypes. Because such an access is required by all genotype imputation methods, it is helpful to develop methods that impute summary statistics without going through the interim step of imputing genotypes. Even when subject-level genotypes are available, due to the substantial computational cost of the typical genotype imputation, there is a need for faster imputation methods. RESULTS: Direct Imputation of summary STatistics (DIST) imputes the summary statistics of untyped variants without first imputing their subject-level genotypes. This is achieved by (i) using the conditional expectation formula for multivariate normal variates and (ii) using the correlation structure from a relevant reference population. When compared with genotype imputation methods, DIST (i) requires only a fraction of their computational resources, (ii) has comparable imputation accuracy for independent subjects and (iii) is readily applicable to the imputation of association statistics coming from large pedigree data. Thus, the proposed application is useful for a fast imputation of summary results for (i) studies of unrelated subjects, which (a) do not provide subject-level genotypes or (b) have a large size and (ii) family association studies. AVAILABILITY AND IMPLEMENTATION: Pre-compiled executables built under commonly used operating systems are publicly available at http://code.google.com/p/dist/. CONTACT: dlee4@vcu.edu .
Genome-wide association studies (GWASs) have been successful in detecting associations between genetic variants and complex diseases (Hindorff ). However, GWASs genotype only a fraction of the tens of millions of single nucleotide polymorphisms (SNPs) found in the human genome. To increase resolution, and thus the detection rate for genetic signals, researchers proposed imputing genotypes at numerous untyped (unmeasured) SNPs (de Bakker ; Li ; Marchini and Howie, 2010).Most commonly used genotype imputation tools, e.g. IMPUTE2 (Howie ), MACH (Li ), BIMBAM (Servin and Stephens, 2007) and BEAGLE (Browning and Browning, 2007), are based on Hidden Markov Models (HMMs). Although these methods are accurate, due to their need for haplotypic phasing of all subjects in the study, they are extremely burdensome computationally. Their computational burden would become even more extreme with the ever increasing size of studies and reference panels. Other genotype imputation methods [PLINK (Purcell ), SNPMSTAT (Lin ), UNPHASED (Dudbridge, 2008) and TUNA (Wen and Nicolae, 2008)] are based on multinomial models (MMs) of haplotype frequencies instead of HMM. These methods are simpler and faster, but their imputation accuracy is generally lower than the accuracy of HMM-based methods (Marchini and Howie, 2008, 2010). Recently, researchers proposed a new MM-based imputation method, called BLIMP, which imputes genotypes/allele frequencies for unmeasured SNPs by using the conditional expectation formula for multivariate normal variates (Wen and Stephens, 2010). Regardless of their model usage, all these imputation tools require a two-stage procedure that (i) imputes subject-level genotypes at the unmeasured SNPs on the basis of genotypes at measured SNPs and a relevant reference population [e.g. 1000 Genomes (1KG) (Altshuler )] and (ii) tests for association between imputed genotypes and phenotype of interest. However, this procedure requires access to subject-level genotypes, which are often unavailable.To directly impute summary statistics while (i) substantially reducing the computational burden and (ii) retaining imputation accuracy, we propose Direct Imputation of summary STatistics (DIST). DIST avoids the imputation of subject-level genotypes by directly applying, to unmeasured SNP statistics, the classical conditional expectation formula for multivariate normal variates.
2 SOFTWARE
DIST imputes the statistics at the unmeasured SNPs in a prediction window as a function of (i) the statistics at measured SNPs in a larger window, henceforth denoted as extended window, and (ii) the correlation matrix of both measured and unmeasured statistics, as estimated from a relevant reference dataset. Similar to BLIMP (Wen and Stephens, 2010), DIST uses the conditional mean formula for multivariate normal variates. Unlike BLIMP, DIST applies the formula (i) directly to summary statistics and (ii) under the null hypothesis, i.e. the hypothesis under which the distribution of all statistical tests is computed. This novel application allows the imputation of summary statistics without imputing subject-level genotypes (see Section 1.1 in Supplementary Material). To reduce computation time, DIST is implemented in C++, which ensures that it can be easily used under Linux, Windows, MacOS etc. DIST takes as input a file containing the (normally distributed) GWAS/meta-analysis summary statistics (see Section 1.5 of Supplementary Material for more information on obtaining the normally distributed statistics). It provides a command line interface with various options for specifying the number of measured SNPs to be contained in (i) the prediction window and (ii) each side region of the extended window, and so forth (Supplementary Table S1).
3 RESULTS
We compare the performance of DIST with the performance of typical HMM and MM methods. Given the speed and accuracy of SHAPEIT phasing (Delaneau ), we chose IMPUTE2 as the representative for the HMM-based methods. Given its wide availability, we chose PLINK as the representative for the MM-based methods. Thus, we compare the performance of DIST, IMPUTE2 and PLINK (at default settings) to predict 99 imputed height SNPs in 25 realistic simulations under both the null and the alternative hypothesis (Fig. 1). The phenotypes (height) of 5000 subjects are simulated as a function of the effects at the 180 significant SNPs from the height meta-analysis (Lango ) (Section 2 in Supplementary Material). Imputations used Europeans in 1KG as the reference sample and were performed on a single Linux machine (Intel Xeon 2.67 GHz processor and 64 GB of RAM).
Fig. 1.
Imputed Z-scores as a function of the true Z-scores by imputation method (strip), under the null (red) and the alternative (blue) hypothesis
Imputed Z-scores as a function of the true Z-scores by imputation method (strip), under the null (red) and the alternative (blue) hypothesisThe average accuracy of imputed Z-scores in the above 50 simulations, as measured by the squared correlation coefficient (r2) between imputed and true Z-scores, is high for DIST (0.98) and IMPUTE2 (0.99) and moderately high for PLINK (0.92) (Fig. 1). The average running time per simulation was 76 min for DIST, 270/965 min for imputing/pre-phasing for IMPUTE2 and 3971 min for PLINK. The maximum memory requirement was 52 MB for DIST, ∼9500 MB for IMPUTE2 and 5300 MB for PLINK. [DIST and IMPUTE2 were also used to impute the statistics for 5% SNPs missing at random on chromosome 22 of a dataset of 5000 subjects (which included the data described in the next paragraph); DIST required 33 min of running time and at most 283 MB of memory, and IMPUTE2 required 846/3437 min for imputation/pre-phasing and 9470 MB of memory.]To impute untyped statistics, DIST requires only the joint correlation matrix for the statistics at typed and untyped SNPs. Because this matrix does not depend on the relationship between subjects in the study, unlike genotype imputation methods, DIST can be readily used to impute statistics for family association studies. We illustrate this advantage by applying the method to a proprietary Irish alcohol dependence study sample consisting of 1755 controls and 710 cases from 431 Irish families. The subjects were genotyped using Affymetrix 6.0 SNP array, and the association statistics were computed using MQLS (Thornton and McPeek, 2007). To impute unmeasured SNPs, we used UK10K (www.uk10k.org) as the reference panel (Supplementary Fig. S2).
4 CONCLUSIONS
DIST is a novel tool for direct imputation of summary statistics at untyped SNPs. When compared with genotype imputation methods, DIST (i) does not need access to subject-level genotypes, (ii) provides comparable imputation accuracy while substantially shortening the running time and (iii) can be readily applied to family association statistics. Consequently, DIST is useful for investigators who need fast and fairly accurate access to imputation-based P-values but (i) do not have access to subject-level genotypes, (ii) do not want to go through the laborious process of imputing subject-level genotypes or (iii) have association statistics coming from (large) pedigree data. Unlike genotype imputation methods, as the available reference panels are increasing in size, DIST can avoid incurring large increases in running time/memory by storing the local correlation structures into pre-computed tables.When compared with genotype imputation methods, DIST uses a smaller imputation window and requires that study and reference populations to be well matched. Thus, when access to subject-level genotypic data is available, genotype imputation methods are likely to outperform DIST (i) for regions with long-range linkage disequilibrium, e.g. major histocompatibility complex locus, and (ii) when the study and the reference populations are not well matched. Consequently, whenever possible/appropriate, we recommend to follow-up DIST signals using a genotype imputation method.Funding: Virginia Commonwealth University start-up fund (to S.A.B.)Conflict of Interest: none declared.
Authors: Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio Journal: Proc Natl Acad Sci U S A Date: 2009-05-27 Impact factor: 11.205
Authors: Hana Lango Allen; Karol Estrada; Guillaume Lettre; Sonja I Berndt; Michael N Weedon; Fernando Rivadeneira; Cristen J Willer; Anne U Jackson; Sailaja Vedantam; Soumya Raychaudhuri; Teresa Ferreira; Andrew R Wood; Robert J Weyant; Ayellet V Segrè; Elizabeth K Speliotes; Eleanor Wheeler; Nicole Soranzo; Ju-Hyun Park; Jian Yang; Daniel Gudbjartsson; Nancy L Heard-Costa; Joshua C Randall; Lu Qi; Albert Vernon Smith; Reedik Mägi; Tomi Pastinen; Liming Liang; Iris M Heid; Jian'an Luan; Gudmar Thorleifsson; Thomas W Winkler; Michael E Goddard; Ken Sin Lo; Cameron Palmer; Tsegaselassie Workalemahu; Yurii S Aulchenko; Asa Johansson; M Carola Zillikens; Mary F Feitosa; Tõnu Esko; Toby Johnson; Shamika Ketkar; Peter Kraft; Massimo Mangino; Inga Prokopenko; Devin Absher; Eva Albrecht; Florian Ernst; Nicole L Glazer; Caroline Hayward; Jouke-Jan Hottenga; Kevin B Jacobs; Joshua W Knowles; Zoltán Kutalik; Keri L Monda; Ozren Polasek; Michael Preuss; Nigel W Rayner; Neil R Robertson; Valgerdur Steinthorsdottir; Jonathan P Tyrer; Benjamin F Voight; Fredrik Wiklund; Jianfeng Xu; Jing Hua Zhao; Dale R Nyholt; Niina Pellikka; Markus Perola; John R B Perry; Ida Surakka; Mari-Liis Tammesoo; Elizabeth L Altmaier; Najaf Amin; Thor Aspelund; Tushar Bhangale; Gabrielle Boucher; Daniel I Chasman; Constance Chen; Lachlan Coin; Matthew N Cooper; Anna L Dixon; Quince Gibson; Elin Grundberg; Ke Hao; M Juhani Junttila; Lee M Kaplan; Johannes Kettunen; Inke R König; Tony Kwan; Robert W Lawrence; Douglas F Levinson; Mattias Lorentzon; Barbara McKnight; Andrew P Morris; Martina Müller; Julius Suh Ngwa; Shaun Purcell; Suzanne Rafelt; Rany M Salem; Erika Salvi; Serena Sanna; Jianxin Shi; Ulla Sovio; John R Thompson; Michael C Turchin; Liesbeth Vandenput; Dominique J Verlaan; Veronique Vitart; Charles C White; Andreas Ziegler; Peter Almgren; Anthony J Balmforth; Harry Campbell; Lorena Citterio; Alessandro De Grandi; Anna Dominiczak; Jubao Duan; Paul Elliott; Roberto Elosua; Johan G Eriksson; Nelson B Freimer; Eco J C Geus; Nicola Glorioso; Shen Haiqing; Anna-Liisa Hartikainen; Aki S Havulinna; Andrew A Hicks; Jennie Hui; Wilmar Igl; Thomas Illig; Antti Jula; Eero Kajantie; Tuomas O Kilpeläinen; Markku Koiranen; Ivana Kolcic; Seppo Koskinen; Peter Kovacs; Jaana Laitinen; Jianjun Liu; Marja-Liisa Lokki; Ana Marusic; Andrea Maschio; Thomas Meitinger; Antonella Mulas; Guillaume Paré; Alex N Parker; John F Peden; Astrid Petersmann; Irene Pichler; Kirsi H Pietiläinen; Anneli Pouta; Martin Ridderstråle; Jerome I Rotter; Jennifer G Sambrook; Alan R Sanders; Carsten Oliver Schmidt; Juha Sinisalo; Jan H Smit; Heather M Stringham; G Bragi Walters; Elisabeth Widen; Sarah H Wild; Gonneke Willemsen; Laura Zagato; Lina Zgaga; Paavo Zitting; Helene Alavere; Martin Farrall; Wendy L McArdle; Mari Nelis; Marjolein J Peters; Samuli Ripatti; Joyce B J van Meurs; Katja K Aben; Kristin G Ardlie; Jacques S Beckmann; John P Beilby; Richard N Bergman; Sven Bergmann; Francis S Collins; Daniele Cusi; Martin den Heijer; Gudny Eiriksdottir; Pablo V Gejman; Alistair S Hall; Anders Hamsten; Heikki V Huikuri; Carlos Iribarren; Mika Kähönen; Jaakko Kaprio; Sekar Kathiresan; Lambertus Kiemeney; Thomas Kocher; Lenore J Launer; Terho Lehtimäki; Olle Melander; Tom H Mosley; Arthur W Musk; Markku S Nieminen; Christopher J O'Donnell; Claes Ohlsson; Ben Oostra; Lyle J Palmer; Olli Raitakari; Paul M Ridker; John D Rioux; Aila Rissanen; Carlo Rivolta; Heribert Schunkert; Alan R Shuldiner; David S Siscovick; Michael Stumvoll; Anke Tönjes; Jaakko Tuomilehto; Gert-Jan van Ommen; Jorma Viikari; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael A Province; Manfred Kayser; Alice M Arnold; Larry D Atwood; Eric Boerwinkle; Stephen J Chanock; Panos Deloukas; Christian Gieger; Henrik Grönberg; Per Hall; Andrew T Hattersley; Christian Hengstenberg; Wolfgang Hoffman; G Mark Lathrop; Veikko Salomaa; Stefan Schreiber; Manuela Uda; Dawn Waterworth; Alan F Wright; Themistocles L Assimes; Inês Barroso; Albert Hofman; Karen L Mohlke; Dorret I Boomsma; Mark J Caulfield; L Adrienne Cupples; Jeanette Erdmann; Caroline S Fox; Vilmundur Gudnason; Ulf Gyllensten; Tamara B Harris; Richard B Hayes; Marjo-Riitta Jarvelin; Vincent Mooser; Patricia B Munroe; Willem H Ouwehand; Brenda W Penninx; Peter P Pramstaller; Thomas Quertermous; Igor Rudan; Nilesh J Samani; Timothy D Spector; Henry Völzke; Hugh Watkins; James F Wilson; Leif C Groop; Talin Haritunians; Frank B Hu; Robert C Kaplan; Andres Metspalu; Kari E North; David Schlessinger; Nicholas J Wareham; David J Hunter; Jeffrey R O'Connell; David P Strachan; H-Erich Wichmann; Ingrid B Borecki; Cornelia M van Duijn; Eric E Schadt; Unnur Thorsteinsdottir; Leena Peltonen; André G Uitterlinden; Peter M Visscher; Nilanjan Chatterjee; Ruth J F Loos; Michael Boehnke; Mark I McCarthy; Erik Ingelsson; Cecilia M Lindgren; Gonçalo R Abecasis; Kari Stefansson; Timothy M Frayling; Joel N Hirschhorn Journal: Nature Date: 2010-09-29 Impact factor: 49.962
Authors: Alexis C Edwards; Tim B Bigdeli; Anna R Docherty; Silviu Bacanu; Donghyung Lee; Teresa R de Candia; Arden Moscati; Dawn L Thiselton; Brion S Maher; Brandon K Wormley; Dermot Walsh; Francis A O'Neill; Kenneth S Kendler; Brien P Riley; Ayman H Fanous Journal: Schizophr Bull Date: 2015-08-27 Impact factor: 9.306
Authors: Alexander Gusev; S Hong Lee; Gosia Trynka; Hilary Finucane; Bjarni J Vilhjálmsson; Han Xu; Chongzhi Zang; Stephan Ripke; Brendan Bulik-Sullivan; Eli Stahl; Anna K Kähler; Christina M Hultman; Shaun M Purcell; Steven A McCarroll; Mark Daly; Bogdan Pasaniuc; Patrick F Sullivan; Benjamin M Neale; Naomi R Wray; Soumya Raychaudhuri; Alkes L Price Journal: Am J Hum Genet Date: 2014-11-06 Impact factor: 11.025
Authors: Diptavo Dutta; Peter VandeHaar; Lars G Fritsche; Sebastian Zöllner; Michael Boehnke; Laura J Scott; Seunggeun Lee Journal: Am J Hum Genet Date: 2021-03-16 Impact factor: 11.025