Louis-Philippe Lemieux Perreault1, Marc-André Legault1,2, Géraldine Asselin1, Marie-Pierre Dubé1,3. 1. Beaulieu-Saucier Université de Montréal Pharmacogenomics Centre, Montreal Heart Institute Research Center, Montréal, Canada H1T 1C8. 2. Department of Biochemistry and molecular medicine, Université de Montréal, Montreal, Canada H3T 1J4. 3. Department of Medicine, Université de Montréal, Montreal, Canada H3T 1J4.
Abstract
Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies). AVAILABILITY AND IMPLEMENTATION: The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies). AVAILABILITY AND IMPLEMENTATION: The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online.
Genome-wide association studies (GWAS) are usually performed on datasets containing over 1 million genetic markers. Those markers are typed using high-throughput genotyping arrays that target a small fraction of all possible genetic variants. Imputation is a low-cost and popular statistical method to infer genotypic information at up to 80 million known genetic variants (including single nucleotide variants and insertions/deletions). Imputation is often used to boost statistical power and it can be used to infer missing data and standardize variant sets for meta-analysis (Marchini ).Large sequencing projects spanning multiple human populations greatly increase the availability and quality of public imputation panels, but the statistical methods needed for haplotype phasing and imputation at the genome-wide level are computationally intensive. In order to streamline this process, we have developed a genome-wide imputation pipeline that automates all necessary computational steps including quality control and reporting while providing support for high performance computing environments and statistical tests for imputed data.
2 Methods
2.1 Main pipeline
The main pipeline uses three commonly used bioinformatics tools: PLINK (Purcell ) for the initial genetics data management, SHAPEIT (Delaneau ) for the loci strand verification and the phasing step and IMPUTE2 (Howie ) for the imputation process. The genipe package orchestrates the pipeline according to the best practices for genome-wide imputation analysis (as described by SHAPEIT and IMPUTE2), and manages the intermediate files created by the tools at the different stages. The following steps are performed (most of them in parallel, see Supplementary Table S1) in a typical imputation analysis with genipe.Quality metrics and statistics on the initial dataset are obtained, including the missing call rate which is computed using PLINK for all loci and all samples of the study dataset. Genetic loci from the study panel are then filtered according to their genetic location and allele composition. Starting from a binary ped file (PLINK format), loci located on the Y and Mt chromosomes, or with ambiguous alleles (A/T or G/C) are filtered out of the dataset. At the same time, a single version of a genetic marker is kept for downstream analysis (i.e. duplicated variants). An optional strand check step can be performed to ensure consistency between the human genome reference and the microarray derived alleles. This means that the nucleotides that are inconsistent with the reference will be complemented with respect to the Watson Crick pairings. The reference strands of the study data and that of the imputation reference panels are then compared using SHAPEIT and inconsistent loci are flipped. A second and last strand verification is performed and the remaining loci that are discordant with the imputation reference are excluded from the dataset.The remaining markers are phased in parallel for each chromosome using SHAPEIT. The phasing tool allows for multiple threads to run concurrently for each chromosome, increasing the speed of the analysis.Finally, IMPUTE2 is used to perform genome-wide imputation. Each chromosome is split into segments of 5 Mb (by default) for the imputation process in accordance with the segment size limitation of the IMPUTE2 program. For each segment, IMPUTE2 produces cross validation metrics, where a subset of genotyped loci are masked, imputed and compared to their expected value. The genipe package aggregates those statistics at a chromosomal and genomic level using a weighted arithmetic mean.Results for all segments are merged to produce one result file per chromosome. Multiple companion files, including important information such as minor allele frequency and information value (INFO) are also included. The merged IMPUTE2 results file can optionally be compressed using BGZIP, a variant of GZIP that allows indexing.The pipeline can be run on a desktop computer or on a DRMAA compliant high-performance computing server. The phasing and imputation steps are customizable to fit any study design. A database keeps track of all executed steps and enables post-failure relaunch of the pipeline where it last stopped, saving processing time and resources. At completion, a LATEX report containing important imputation quality metrics and processing times is automatically created.
2.2 Output file management
Given the high number of variants imputed using a genome-wide approach, file management becomes a computational challenge that needs to be addressed. To facilitate this task, we have developed the impute2-extractor utility that uses file indexing to accelerate extraction of variants from IMPUTE2 files by name or by genomic region. Allele frequency filtering as well as quality control thresholds based on the completion rate, the INFO field and the imputation probability can also be specified. This script can be used as a stand-alone utility and is automatically installed and added to the system path after setting up genipe.
2.3 Statistical analysis
Bindings for linear and logistic regression using the statsmodels package (Seabold ) are included for common variant association testing with continuous and discrete outcomes. Other more sophisticated models, which were not previously available for dosage data in an efficient tool, now have bindings with genipe. These include the Cox proportional hazards model (lifelines package) which had no suitable implementations for pharmacogenomics GWAS (Syed ), an optimization (Sikorska ) of the linear mixed model for repeated measurements (statsmodels package, see Supplementary Figs S2 and S3), and the Sequence Kernel Association Test [SKAT R package (Ionita-Laza )]. All these tests are now available for dosage data through the companion script imputed-stats. See Supplementary Figure S1 for execution times for the different models.
3 Application
The pipeline’s documentation provides a typical imputation analysis tutorial along with required files. Those files includes a dataset of 2 278 357 markers genotyped on 90 HapMap samples. Using the 1000 Genomes Phase 3 reference panels, the dataset was imputed on two different systems: a computing server (10 nodes of 8 Intel® Xeon® E5620 CPUs 2.40 GHz, 48G of RAM per node) using the DRMAA API for automatic task submission and a desktop computer (Intel® CoreTM i7-3770 CPU 3.40 GHz, 16G of RAM). Using a maximum of 50 simultaneous tasks, the pipeline took a total of 4.25 h on the computing server (including a waiting period of 0.08 h in queue). Using a maximum of four simultaneous tasks, the pipeline took 10.62 h to complete on the desktop computer.
4 Conclusion
Although online imputation pipeline exists (e.g. the Michigan Imputation Server that uses Minimac3; https://imputationserver.sph.umich.edu/ and the Sanger Imputation Service that uses PBWT; https://imputation.sanger.ac.uk/), genipe is advantageous for the users who cannot upload genotypic data on an off-site server for ethical or legal restrictions. Also, as public servers gain in popularity, the high workload can add significant time to the imputation analysis (queue time). The genipe pipeline can be efficiently executed on a local high-performance computing server or on a single desktop computer.Finally, genipe provides a unified interface to statistical analysis packages that did not have existing tools to automate the use of dosage data (e.g. linear mixed models from statsmodels, Cox proportional hazards from lifelines and SKAT).
Funding
This work was supported by the Montreal Heart Institute Foundation; Genome Canada and Genome Quebec.Conflict of Interest: none declared
Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025
Authors: Sang Jin Kim; Kemal Sonmez; Ryan Swan; J Peter Campbell; Susan Ostmo; R V Paul Chan; Aaron Nagiel; Kimberly A Drenser; Audina M Berrocal; Jason D Horowitz; Xiaohui Li; Yii-Der Ida Chen; Kent D Taylor; Charles Simmons; Jerome I Rotter; Michael F Chiang Journal: Sci Rep Date: 2021-03-02 Impact factor: 4.996
Authors: Tanner Koomar; Taylor R Thomas; Natalie R Pottschmidt; Michael Lutter; Jacob J Michaelson Journal: Front Psychiatry Date: 2021-06-09 Impact factor: 5.435