Literature DB >> 27497439

genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools.

Louis-Philippe Lemieux Perreault¹, Marc-André Legault^1,2, Géraldine Asselin¹, Marie-Pierre Dubé^1,3.

Abstract

Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies).
AVAILABILITY AND IMPLEMENTATION: The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online.

Entities: Gene Species

Mesh：

Year: 2016 PMID： 27497439 PMCID： PMC5181529 DOI： 10.1093/bioinformatics/btw487

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Genome-wide association studies (GWAS) are usually performed on datasets containing over 1 million genetic markers. Those markers are typed using high-throughput genotyping arrays that target a small fraction of all possible genetic variants. Imputation is a low-cost and popular statistical method to infer genotypic information at up to 80 million known genetic variants (including single nucleotide variants and insertions/deletions). Imputation is often used to boost statistical power and it can be used to infer missing data and standardize variant sets for meta-analysis (Marchini ). Large sequencing projects spanning multiple human populations greatly increase the availability and quality of public imputation panels, but the statistical methods needed for haplotype phasing and imputation at the genome-wide level are computationally intensive. In order to streamline this process, we have developed a genome-wide imputation pipeline that automates all necessary computational steps including quality control and reporting while providing support for high performance computing environments and statistical tests for imputed data.

2 Methods

2.1 Main pipeline

The main pipeline uses three commonly used bioinformatics tools: PLINK (Purcell ) for the initial genetics data management, SHAPEIT (Delaneau ) for the loci strand verification and the phasing step and IMPUTE2 (Howie ) for the imputation process. The genipe package orchestrates the pipeline according to the best practices for genome-wide imputation analysis (as described by SHAPEIT and IMPUTE2), and manages the intermediate files created by the tools at the different stages. The following steps are performed (most of them in parallel, see Supplementary Table S1) in a typical imputation analysis with genipe. Quality metrics and statistics on the initial dataset are obtained, including the missing call rate which is computed using PLINK for all loci and all samples of the study dataset. Genetic loci from the study panel are then filtered according to their genetic location and allele composition. Starting from a binary ped file (PLINK format), loci located on the Y and Mt chromosomes, or with ambiguous alleles (A/T or G/C) are filtered out of the dataset. At the same time, a single version of a genetic marker is kept for downstream analysis (i.e. duplicated variants). An optional strand check step can be performed to ensure consistency between the human genome reference and the microarray derived alleles. This means that the nucleotides that are inconsistent with the reference will be complemented with respect to the Watson Crick pairings. The reference strands of the study data and that of the imputation reference panels are then compared using SHAPEIT and inconsistent loci are flipped. A second and last strand verification is performed and the remaining loci that are discordant with the imputation reference are excluded from the dataset. The remaining markers are phased in parallel for each chromosome using SHAPEIT. The phasing tool allows for multiple threads to run concurrently for each chromosome, increasing the speed of the analysis. Finally, IMPUTE2 is used to perform genome-wide imputation. Each chromosome is split into segments of 5 Mb (by default) for the imputation process in accordance with the segment size limitation of the IMPUTE2 program. For each segment, IMPUTE2 produces cross validation metrics, where a subset of genotyped loci are masked, imputed and compared to their expected value. The genipe package aggregates those statistics at a chromosomal and genomic level using a weighted arithmetic mean. Results for all segments are merged to produce one result file per chromosome. Multiple companion files, including important information such as minor allele frequency and information value (INFO) are also included. The merged IMPUTE2 results file can optionally be compressed using BGZIP, a variant of GZIP that allows indexing. The pipeline can be run on a desktop computer or on a DRMAA compliant high-performance computing server. The phasing and imputation steps are customizable to fit any study design. A database keeps track of all executed steps and enables post-failure relaunch of the pipeline where it last stopped, saving processing time and resources. At completion, a LATEX report containing important imputation quality metrics and processing times is automatically created.

2.2 Output file management

Given the high number of variants imputed using a genome-wide approach, file management becomes a computational challenge that needs to be addressed. To facilitate this task, we have developed the impute2-extractor utility that uses file indexing to accelerate extraction of variants from IMPUTE2 files by name or by genomic region. Allele frequency filtering as well as quality control thresholds based on the completion rate, the INFO field and the imputation probability can also be specified. This script can be used as a stand-alone utility and is automatically installed and added to the system path after setting up genipe.

2.3 Statistical analysis

Bindings for linear and logistic regression using the statsmodels package (Seabold ) are included for common variant association testing with continuous and discrete outcomes. Other more sophisticated models, which were not previously available for dosage data in an efficient tool, now have bindings with genipe. These include the Cox proportional hazards model (lifelines package) which had no suitable implementations for pharmacogenomics GWAS (Syed ), an optimization (Sikorska ) of the linear mixed model for repeated measurements (statsmodels package, see Supplementary Figs S2 and S3), and the Sequence Kernel Association Test [SKAT R package (Ionita-Laza )]. All these tests are now available for dosage data through the companion script imputed-stats. See Supplementary Figure S1 for execution times for the different models.

3 Application

The pipeline’s documentation provides a typical imputation analysis tutorial along with required files. Those files includes a dataset of 2 278 357 markers genotyped on 90 HapMap samples. Using the 1000 Genomes Phase 3 reference panels, the dataset was imputed on two different systems: a computing server (10 nodes of 8 Intel® Xeon® E5620 CPUs 2.40 GHz, 48G of RAM per node) using the DRMAA API for automatic task submission and a desktop computer (Intel® CoreTM i7-3770 CPU 3.40 GHz, 16G of RAM). Using a maximum of 50 simultaneous tasks, the pipeline took a total of 4.25 h on the computing server (including a waiting period of 0.08 h in queue). Using a maximum of four simultaneous tasks, the pipeline took 10.62 h to complete on the desktop computer.

4 Conclusion

Although online imputation pipeline exists (e.g. the Michigan Imputation Server that uses Minimac3; https://imputationserver.sph.umich.edu/ and the Sanger Imputation Service that uses PBWT; https://imputation.sanger.ac.uk/), genipe is advantageous for the users who cannot upload genotypic data on an off-site server for ethical or legal restrictions. Also, as public servers gain in popularity, the high workload can add significant time to the imputation analysis (queue time). The genipe pipeline can be efficiently executed on a local high-performance computing server or on a single desktop computer. Finally, genipe provides a unified interface to statistical analysis packages that did not have existing tools to automate the use of dosage data (e.g. linear mixed models from statsmodels, Cox proportional hazards from lifelines and SKAT).

Funding

This work was supported by the Montreal Heart Institute Foundation; Genome Canada and Genome Quebec. Conflict of Interest: none declared

7 in total

1. A linear complexity phasing method for thousands of genomes.

Authors: Olivier Delaneau; Jonathan Marchini; Jean-François Zagury
Journal: Nat Methods Date: 2011-12-04 Impact factor: 28.547

Review 2. Genotype imputation for genome-wide association studies.

Authors: Jonathan Marchini; Bryan Howie
Journal: Nat Rev Genet Date: 2010-07 Impact factor: 53.242

3. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

4. GWAS with longitudinal phenotypes: performance of approximate procedures.

Authors: Karolina Sikorska; Nahid Mostafavi Montazeri; André Uitterlinden; Fernando Rivadeneira; Paul Hc Eilers; Emmanuel Lesaffre
Journal: Eur J Hum Genet Date: 2015-02-25 Impact factor: 4.246

5. Sequence kernel association tests for the combined effect of rare and common variants.

Authors: Iuliana Ionita-Laza; Seunggeun Lee; Vlad Makarov; Joseph D Buxbaum; Xihong Lin
Journal: Am J Hum Genet Date: 2013-05-16 Impact factor: 11.025

6. Evaluation of methodology for the analysis of 'time-to-event' data in pharmacogenomic genome-wide association studies.

Authors: Hamzah Syed; Andrea L Jorgensen; Andrew P Morris
Journal: Pharmacogenomics Date: 2016-06-01 Impact factor: 2.533

7. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Authors: Bryan N Howie; Peter Donnelly; Jonathan Marchini
Journal: PLoS Genet Date: 2009-06-19 Impact factor: 5.917

7 in total

8 in total

1. gwasurvivr: an R package for genome-wide survival analysis.

Authors: Abbas A Rizvi; Ezgi Karaesmen; Martin Morgan; Leah Preus; Junke Wang; Michael Sovic; Theresa Hahn; Lara E Sucheston-Campbell
Journal: Bioinformatics Date: 2019-06-01 Impact factor: 6.937

2. Genetic meta-analysis of cancer diagnosis following statin use identifies new associations and implicates human leukocyte antigen (HLA) in women.

Authors: Jean-Claude Tardif; Marie-Pierre Dubé; Maxine Sun; Audrey Lemaçon; Marc-André Legault; Géraldine Asselin; Sylvie Provost; Hugues Aschard; Amina Barhdadi; Yassamin Feroz Zada; Diane Valois; Ian Mongrain
Journal: Pharmacogenomics J Date: 2021-03-01 Impact factor: 3.550

3. Genetic and morphological estimates of androgen exposure predict social deficits in multiple neurodevelopmental disorder cohorts.

Authors: Brooke G McKenna; Yongchao Huang; Kévin Vervier; Dabney Hofammann; Mary Cafferata; Seima Al-Momani; Florencia Lowenthal; Angela Zhang; Jin-Young Koh; Savantha Thenuwara; Leo Brueggeman; Ethan Bahl; Tanner Koomar; Natalie Pottschmidt; Taylor Kalmus; Lucas Casten; Taylor R Thomas; Jacob J Michaelson
Journal: Mol Autism Date: 2021-06-09 Impact factor: 6.476

Review 4. Open Humans: A platform for participant-centered research and personal data exploration.

Authors: Bastian Greshake Tzovaras; Misha Angrist; Kevin Arvai; Mairi Dulaney; Vero Estrada-Galiñanes; Beau Gunderson; Tim Head; Dana Lewis; Oded Nov; Orit Shaer; Athina Tzovara; Jason Bobe; Mad Price Ball
Journal: Gigascience Date: 2019-06-01 Impact factor: 6.524

5. Identification of candidate genes and pathways in retinopathy of prematurity by whole exome sequencing of preterm infants enriched in phenotypic extremes.

Authors: Sang Jin Kim; Kemal Sonmez; Ryan Swan; J Peter Campbell; Susan Ostmo; R V Paul Chan; Aaron Nagiel; Kimberly A Drenser; Audina M Berrocal; Jason D Horowitz; Xiaohui Li; Yii-Der Ida Chen; Kent D Taylor; Charles Simmons; Jerome I Rotter; Michael F Chiang
Journal: Sci Rep Date: 2021-03-02 Impact factor: 4.996