Literature DB >> 32803703

GEN2VCF: a converter for human genome imputation output format to VCF format.

Dong Mun Shin^1,2, Mi Yeong Hwang¹, Bong-Jo Kim¹, Keun Ho Ryu^3,4, Young Jin Kim⁵.

Abstract

BACKGROUND: For a genome-wide association study in humans, genotype imputation is an essential analysis tool for improving association mapping power. When IMPUTE software is used for imputation analysis, an imputation output (GEN format) should be converted to variant call format (VCF) with imputed genotype dosage for association analysis. However, the conversion requires multiple software packages in a pipeline with a large amount of processing time.
OBJECTIVE: We developed GEN2VCF, a fast and convenient GEN format to VCF conversion tool with dosage support.
METHODS: The performance of GEN2VCF was compared to BCFtools, QCTOOL, and Oncofunco. The test data set was a 1 Mb GEN-formatted file of 5000 samples. To determine the performance of various sample sizes, tests were performed from 1000 to 5000 samples with a step size of 1000. Runtime and memory usage were used as performance measures.
RESULTS: GEN2VCF showed drastically increased performances with respect to runtime and memory usage. Runtime and memory usage of GEN2VCF was at least 1.4- and 7.4-fold lower compared to other methods, respectively.
CONCLUSIONS: GEN2VCF provides users with efficient conversion from GEN format to VCF with the best-guessed genotype, genotype posterior probabilities, and genotype dosage, as well as great flexibility in implementation with other software packages in a pipeline.

Entities: Chemical Disease Gene Species

Keywords: Converter; Human genome; Imputation; Parsing; SNP

Year: 2020 PMID： 32803703 PMCID： PMC7497724 DOI： 10.1007/s13258-020-00982-0

Source DB: PubMed Journal: Genes Genomics ISSN： 1976-9571 Impact factor: 1.839

Introduction

A genome-wide association study (GWAS) is a well-known approach to identify genetic variations associated with complex traits (Visscher et al. 2012). The GWAS Catalog is a free online database that collects GWAS results. As of November 2019, the catalog contains 161,525 variant-trait associations from 4298 publications (https://www.ebi.ac.uk/gwas/) (Buniello et al. 2019). In a GWAS, genotype imputation has been regarded as an essential analysis tool to improve the power of association mapping by estimating tens of millions of variants that are not directly genotyped using a single nucleotide polymorphism (SNP) microarray. Genotype imputation infers missing or untyped SNPs in a study dataset from a reference panel, such as the 1000 Genomes project and Haplotype Reference Consortium (Auton et al. 2015; Huang et al. 2009; McCarthy et al. 2016). Various imputation tools have been introduced such as IMPUTE2 (Howie et al. 2009), BEAGLE (Browning and Browning 2016), Mach (Li et al. 2010), and Minimac (Howie et al. 2012). By default, imputation estimates posterior probabilities of three genotypes AA, AB, and BB. These posterior probabilities are often used in a form of three different types in association testing: the best-guessed genotype (GT) with maximum posterior probability; genotype probabilities (GPs); and genotype dosage (DS), which is the posterior mean of three posterior probabilities. Among them, DS is widely used in testing associations for imputed genotypes. The association test using DS showed enhanced statistical power (Liu et al. 2013). However, there are challenges in using imputed dosages in association tests. Dedicated software packages, such as SNPTEST (see URLs) and mach2qtl (see URLs), using imputed dosages in association testing does not support various statistical methods and gene-based tests supported by recent association software packages, such as EPACTS (see URLs) and RAREMETAL (Feng et al. 2014). EPACTS and RAREMETAL are used to perform various statistical analyses and gene-based association tests using variant call format (VCF), which contains formatted imputed genotypes. Although the recently developed Minimac 3 outputs imputation data in a VCF file, IMPUTE only outputs GEN files, a non-VCF file (Howie et al. 2012). Even though IMPUTE does not support VCF, IMPUTE has been widely used in many GWASs due to its high imputation accuracy comparable to Minimac (Das et al. 2016). To handle imputed data from IMPUTE, an additional conversion process is required for subsequent association analyses. Existing tools that support a VCF conversion process, such as BCFtools (see URLs) and QCTOOL (see URLs), convert IMPUTE GEN files to VCF without dosage information. Thus, additional data processing using VCF parsers, such as PySAM (see URLs), is required to obtain dosage information, and the output can be merged with VCF data from BCFtools and QCTOOL. Oncofunco is an R package (see URLs) that converts posterior probabilities in an IMPUTE2 gen file to dosage and then outputs to a VCF file. The VCF file contains only dosage information; therefore, other information is added using the VCF parser. These multiple conversion steps may take a lot of time for reading, modifying, and writing data. Currently, as far as we know, Hail (see URLs) is the only software package that can be used for converting GEN files to VCF files. Hail uses Spark to read and write large data sets (Ganna et al. 2016; Khera et al. 2018). However, the implementation of a Spark-based system environment requires experts in related fields and a supercomputing resource for handling a large-scale dataset. Therefore, a fast and convenient GEN format to VCF conversion tool with DS support is warranted. In this paper, we present a new tool GEN2VCF, which converts the IMPUTE output in GEN format to VCF. GEN2VCF provides DS as well as GT and GP. GEN2VCF is a C-based software that converts GEN files faster than the existing pipelines and is efficient in handling large amounts of data with low memory usage. GEN2VCF also has options for standard input and output of processing data. This feature is particularly useful in implementing GEN2VCF with various different software packages by piping and redirection. We compared the performance of GEN2VCF with three possible pipelines by using combinations of three converting tools (BCFtools, QCTOOL, and Oncofunco) and a VCF parser (PySAM). A subset of chromosome 1 of the imputed data of 5000 samples was used as input data. To measure the performance, runtime and memory usage were used as measures.

Materials and methods

Implementation of GEN2VCF

GEN2VCF was implemented in the C programming language on Linux-based operating systems, which allows for large amounts of imputed data to be handled quickly. Memory usage is also relatively low compared to other programming languages (Fourment and Gillings 2008). All GEN2VCF commands are run in a Linux terminal. Given two alleles of A, B, there are three possible genotypes of a SNP: AA, AB, and BB. The A allele was regarded as the reference allele, and B allele as a coded allele (alternative allele). From the imputation output, the probability of each genotype is given by P(AA), P(AB), and P(BB). An imputed genotype dosage was estimated as 0 · P(AA) + 1 · P(AB) + 2 · (BB) (Hoffmann and Witte 2015). The dosage has a value between 0 and 2.

Comparison with other existing software packages

For the comparison analysis, we converted a GEN-formatted file, which is an output from IMPUTE software, to a VCF file with GT, GP, and DS. In the conversion from GEN format to VCF, the processes of GEN2VCF and existing software packages (BCFtools, QCTOOL, and Oncofunco) were displayed in Fig. 1. Briefly, there are three main steps during conversion processes: (1) the GEN file generated by IMPUTE is read, (2) dosages are calculated using genotype probabilities in the GEN file, and (3) an indexed compressed (bgzip) VCF file with GT, GP, and DS is generated. The basic characteristics of GEN2VCF and existing software packages are summarized in Table 1. Since the existing software alone do not have an option for handling dosage values for the conversion, an imputed genotype dosage was calculated using the VCF parser PySAM. On the other hand, GEN2VCF provides the conversion in a single process, thereby enabling more efficient analysis.

Fig. 1

Conversion processes of GEN2VCF and existing software packages

Table 1

Basic characteristics of methods used in this study

Tool	Input	Output	VCF FORMAT Field	Ver.	Ref.	URL
BCFtools	GEN, BCF, VCF, HAPS	BCF, VCF	GT, PL, GL, GQ, GP	1.6	PMID: 21,903,627	http://samtools.github.io/bcftools/bcftools.html
QCTOOL	GEN, BGEN, VCF	GEN, VCF	GT, GP	1.4	doi: 10.1101/308296	http://www.well.ox.ac.uk/~gav/qctool/#overview
GEN2VCF	GEN, BGEN	VCF	GT, GP, DS	1.0	–	https://bitbucket.org/4shin/division-of-genome-research/src/master
Oncofunco	GEN	VCF	DS	–	–	https://github.com/oncogenetics/oncofunco

Conversion processes of GEN2VCF and existing software packages Basic characteristics of methods used in this study

Performance test

For the experiment, we randomly sampled imputed data from a 1 Mb region on chromosome 1 from 5000 samples that was previously genotyped with the Korea Biobank Array (Moon et al. 2019). The 1 Mb genotype data were pre-phased using Eagle v2.3 (Loh et al. 2016) and imputed using Impute v4 (Bycroft et al. 2018) using the 1000 Genomes project phase 3 data as a reference panel (Auton et al. 2015). The imputed dataset consists of 13,891 variants. All experiments were performed on a computer with an Intel Xeon processors 3.47 GHz (12 cores), 66 GB of memory, and the Linux-based operating system Ubuntu 14.04.6. To measure the performances of GEN2VCF and other software packages, we used total runtime and maximum memory usage as performance measures. All tools were used with their default options in a single process.

Results

We performed a comparison analysis between GEN2VCF and possible three existing pipelines by using combinations of three converting tools (BCFtools, QCTOOL, and Oncofunco) and a VCF parser (PySAM). We converted a GEN-formatted file, which was an output from the IMPUTE software, to a VCF file with GT, GP, and DS. To determine the performance for various sample sizes, tests were performed from 1000 to 5000 samples with a step size of 1000. To determine the performance, total runtime and memory usage was used for each approach. The basic characteristics of the four methods used in this study are summarized in Table 1. BCFtools and QCTOOL only support the GT and GP of each genotype. Oncofunco outputs a VCF file with DS except GT and GP. Therefore, the VCF parser PySAM was used to combine VCF files with partial information to generate a VCF file with GT, GP, and DS. Figure 2 shows the total runtime of each method. As shown in the figure, GEN2VCF was the fastest among the four methods. The second fastest pipeline was Oncofunco and BCFtools used with PySAM. The runtime for generating a VCF file using QCTOOL and PySAM was the lowest of the four. However, GEN2VCF showed a 1.4–17-fold decrease in conversion time compared to the other pipelines.

Fig. 2

Runtime comparison among the four methods

Runtime comparison among the four methods In terms of memory usage during the conversion process, GEN2VCF had the least memory usage among the methods (Fig. 3). Oncofunco and Pysam use more memory than GEN2VCF to generate the VCF file. When using BCFtools and QCTOOL with PySAM, memory usage was comparable to other methods. For the conversion process, as the sample size increased, the difference in memory usage of other methods increased compared to that of GEN2VCF. When a 1 Mb GEN file with 5000 samples was used as the input, GEN2VCF showed a 7.4–1770-fold decrease in memory usage compared to other methods.

Fig. 3

Memory usage comparison among four methods

Discussion

In this study, we developed a new tool to convert the IMPUTE output (GEN format) to VCF with GT, GP, and DS in a single process. The performance of GEN2VCF was compared with three possible pipelines using existing tools. As a result, GEN2VCF showed at least a 1.4-fold decrease in processing time during the conversion. Moreover, GEN2VCF showed the lowest memory usage; at least a 7.4-fold decrease in memory usage was observed when converting a 1 Mb GEN file of 5000 samples. The difference in memory usage was greater by increasing the number of samples for conversion. The memory usage is very important in cases handling millions of samples of whole genome imputed genotypes using a parallel computing environment. Since the maximum memory of a node of a parallel computing environment is limited, large memory usage may produce inefficiencies in the use of computing power for converting GEN files. The increased performance of GEN2VCF was achieved by programming a dedicated conversion software using a high-level C language, minimizing memory usage by processing GEN file line by line appending to a temporary buffer, fast conversion of floating point to string via a custom function. Our results showed that GEN2VCF is an efficient and convenient tool for converting a GEN file to a VCF file with GT, GP, and DS. In addition to the more efficient performance, GEN2VCF provides users with a convenient option of standard input and output for data processing. This feature is particularly useful in implementing GEN2VCF with various different software packages by piping and redirection. For example, an association test can be performed in a single command line by piping a GEN file management tool (i.e., QCTOOL), GEN2VCF, and association software supporting the VCF. Also, the application can be more efficient in managing storage space if used with a compressed imputation output. Imputed genotype data of millions of samples are typically hundreds of terabytes. For example, the BGEN format can significantly save storage space because it has a smaller file size than files with GEN format (Band and Marchini 2018; Bycroft et al. 2018). Indeed, about half a million samples of whole genome imputation data in the UK Biobank required about 2.1 Tb of file space (Bycroft et al. 2018). In a pipelined command, GEN2VCF can handle a standard output from QCTOOL (which converts BGEN files to GEN files), convert GEN format to VCF with GT, GP, and DS, and then the VCF data can also be redirected to other software packages. In conclusion, GEN2VCF provides users not only efficient conversion from GEN format to VCF with GT, GP, and DS, but also great flexibility in implementation with other software packages in a pipelined command.

19 in total

Review 1. Five years of GWAS discovery.

Authors: Peter M Visscher; Matthew A Brown; Mark I McCarthy; Jian Yang
Journal: Am J Hum Genet Date: 2012-01-13 Impact factor: 11.025

2. Genotype Imputation with Millions of Reference Samples.

Authors: Brian L Browning; Sharon R Browning
Journal: Am J Hum Genet Date: 2016-01-07 Impact factor: 11.025

3. Optimal methods for using posterior probabilities in association testing.

Authors: Keli Liu; Alexander Luedtke; Nathan Tintle
Journal: Hum Hered Date: 2013-03-27 Impact factor: 0.444

4. Fast and accurate long-range phasing in a UK Biobank cohort.

Authors: Po-Ru Loh; Pier Francesco Palamara; Alkes L Price
Journal: Nat Genet Date: 2016-06-06 Impact factor: 38.330

5. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

6. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.

Authors: Annalisa Buniello; Jacqueline A L MacArthur; Maria Cerezo; Laura W Harris; James Hayhurst; Cinzia Malangone; Aoife McMahon; Joannella Morales; Edward Mountjoy; Elliot Sollis; Daniel Suveges; Olga Vrousgou; Patricia L Whetzel; Ridwan Amode; Jose A Guillen; Harpreet S Riat; Stephen J Trevanion; Peggy Hall; Heather Junkins; Paul Flicek; Tony Burdett; Lucia A Hindorff; Fiona Cunningham; Helen Parkinson
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

7. A reference panel of 64,976 haplotypes for genotype imputation.

Authors: Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey C Barrett; Dorrett Boomsma; Kari Branham; Gerome Breen; Chad M Brummett; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura J Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James C Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine L Min; Karen L Mohlke; John B Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; J Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenherr; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard H Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul I W de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti; Nicole Soranzo; Klaudia Walter; Anand Swaroop; Francesco Cucca; Carl A Anderson; Richard M Myers; Michael Boehnke; Mark I McCarthy; Richard Durbin
Journal: Nat Genet Date: 2016-08-22 Impact factor: 38.330

8. Ultra-rare disruptive and damaging mutations influence educational attainment in the general population.

Authors: Andrea Ganna; Giulio Genovese; Daniel P Howrigan; Andrea Byrnes; Mitja Kurki; Seyedeh M Zekavat; Christopher W Whelan; Mart Kals; Michel G Nivard; Alex Bloemendal; Jonathan M Bloom; Jacqueline I Goldstein; Timothy Poterba; Cotton Seed; Robert E Handsaker; Pradeep Natarajan; Reedik Mägi; Diane Gage; Elise B Robinson; Andres Metspalu; Veikko Salomaa; Jaana Suvisaari; Shaun M Purcell; Pamela Sklar; Sekar Kathiresan; Mark J Daly; Steven A McCarroll; Patrick F Sullivan; Aarno Palotie; Tõnu Esko; Christina Hultman; Benjamin M Neale
Journal: Nat Neurosci Date: 2016-10-03 Impact factor: 24.884

9. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations.

Authors: Amit V Khera; Mark Chaffin; Krishna G Aragam; Mary E Haas; Carolina Roselli; Seung Hoan Choi; Pradeep Natarajan; Eric S Lander; Steven A Lubitz; Patrick T Ellinor; Sekar Kathiresan
Journal: Nat Genet Date: 2018-08-13 Impact factor: 38.330

10. The UK Biobank resource with deep phenotyping and genomic data.

Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini
Journal: Nature Date: 2018-10-10 Impact factor: 49.962