Dong Mun Shin1,2, Mi Yeong Hwang1, Bong-Jo Kim1, Keun Ho Ryu3,4, Young Jin Kim5. 1. Division of Genome Research, Center for Genome Science, National Institute of Health, Osong Health Technology Administration Complex, 187, Osongsaengmyeong 2-ro, Osong-eup, Heungdeok-gu, Cheongju-si, Chungcheongbuk-do, 28159, Republic of Korea. 2. Database and Bioinformatics Laboratory, Department of Computer Science, College of Electrical and Computer Engineering, Chungbuk National University, 28644, Cheongju, Republic of Korea. 3. Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, 700000, Vietnam. khryu@tdtu.edu.vn. 4. Database and Bioinformatics Laboratory, Department of Computer Science, College of Electrical and Computer Engineering, Chungbuk National University, 28644, Cheongju, Republic of Korea. khryu@tdtu.edu.vn. 5. Division of Genome Research, Center for Genome Science, National Institute of Health, Osong Health Technology Administration Complex, 187, Osongsaengmyeong 2-ro, Osong-eup, Heungdeok-gu, Cheongju-si, Chungcheongbuk-do, 28159, Republic of Korea. inthistime@korea.kr.
Abstract
BACKGROUND: For a genome-wide association study in humans, genotype imputation is an essential analysis tool for improving association mapping power. When IMPUTE software is used for imputation analysis, an imputation output (GEN format) should be converted to variant call format (VCF) with imputed genotype dosage for association analysis. However, the conversion requires multiple software packages in a pipeline with a large amount of processing time. OBJECTIVE: We developed GEN2VCF, a fast and convenient GEN format to VCF conversion tool with dosage support. METHODS: The performance of GEN2VCF was compared to BCFtools, QCTOOL, and Oncofunco. The test data set was a 1 Mb GEN-formatted file of 5000 samples. To determine the performance of various sample sizes, tests were performed from 1000 to 5000 samples with a step size of 1000. Runtime and memory usage were used as performance measures. RESULTS: GEN2VCF showed drastically increased performances with respect to runtime and memory usage. Runtime and memory usage of GEN2VCF was at least 1.4- and 7.4-fold lower compared to other methods, respectively. CONCLUSIONS: GEN2VCF provides users with efficient conversion from GEN format to VCF with the best-guessed genotype, genotype posterior probabilities, and genotype dosage, as well as great flexibility in implementation with other software packages in a pipeline.
BACKGROUND: For a genome-wide association study in humans, genotype imputation is an essential analysis tool for improving association mapping power. When IMPUTE software is used for imputation analysis, an imputation output (GEN format) should be converted to variant call format (VCF) with imputed genotype dosage for association analysis. However, the conversion requires multiple software packages in a pipeline with a large amount of processing time. OBJECTIVE: We developed GEN2VCF, a fast and convenient GEN format to VCF conversion tool with dosage support. METHODS: The performance of GEN2VCF was compared to BCFtools, QCTOOL, and Oncofunco. The test data set was a 1 Mb GEN-formatted file of 5000 samples. To determine the performance of various sample sizes, tests were performed from 1000 to 5000 samples with a step size of 1000. Runtime and memory usage were used as performance measures. RESULTS: GEN2VCF showed drastically increased performances with respect to runtime and memory usage. Runtime and memory usage of GEN2VCF was at least 1.4- and 7.4-fold lower compared to other methods, respectively. CONCLUSIONS: GEN2VCF provides users with efficient conversion from GEN format to VCF with the best-guessed genotype, genotype posterior probabilities, and genotype dosage, as well as great flexibility in implementation with other software packages in a pipeline.
Entities:
Keywords:
Converter; Human genome; Imputation; Parsing; SNP
A genome-wide association study (GWAS) is a well-known approach to identify genetic variations associated with complex traits (Visscher et al. 2012). The GWAS Catalog is a free online database that collects GWAS results. As of November 2019, the catalog contains 161,525 variant-trait associations from 4298 publications (https://www.ebi.ac.uk/gwas/) (Buniello et al. 2019). In a GWAS, genotype imputation has been regarded as an essential analysis tool to improve the power of association mapping by estimating tens of millions of variants that are not directly genotyped using a single nucleotide polymorphism (SNP) microarray. Genotype imputation infers missing or untyped SNPs in a study dataset from a reference panel, such as the 1000 Genomes project and Haplotype Reference Consortium (Auton et al. 2015; Huang et al. 2009; McCarthy et al. 2016). Various imputation tools have been introduced such as IMPUTE2 (Howie et al. 2009), BEAGLE (Browning and Browning 2016), Mach (Li et al. 2010), and Minimac (Howie et al. 2012).By default, imputation estimates posterior probabilities of three genotypes AA, AB, and BB. These posterior probabilities are often used in a form of three different types in association testing: the best-guessed genotype (GT) with maximum posterior probability; genotype probabilities (GPs); and genotype dosage (DS), which is the posterior mean of three posterior probabilities. Among them, DS is widely used in testing associations for imputed genotypes. The association test using DS showed enhanced statistical power (Liu et al. 2013).However, there are challenges in using imputed dosages in association tests. Dedicated software packages, such as SNPTEST (see URLs) and mach2qtl (see URLs), using imputed dosages in association testing does not support various statistical methods and gene-based tests supported by recent association software packages, such as EPACTS (see URLs) and RAREMETAL (Feng et al. 2014). EPACTS and RAREMETAL are used to perform various statistical analyses and gene-based association tests using variant call format (VCF), which contains formatted imputed genotypes. Although the recently developed Minimac 3 outputs imputation data in a VCF file, IMPUTE only outputs GEN files, a non-VCF file (Howie et al. 2012). Even though IMPUTE does not support VCF, IMPUTE has been widely used in many GWASs due to its high imputation accuracy comparable to Minimac (Das et al. 2016). To handle imputed data from IMPUTE, an additional conversion process is required for subsequent association analyses.Existing tools that support a VCF conversion process, such as BCFtools (see URLs) and QCTOOL (see URLs), convert IMPUTE GEN files to VCF without dosage information. Thus, additional data processing using VCF parsers, such as PySAM (see URLs), is required to obtain dosage information, and the output can be merged with VCF data from BCFtools and QCTOOL. Oncofunco is an R package (see URLs) that converts posterior probabilities in an IMPUTE2 gen file to dosage and then outputs to a VCF file. The VCF file contains only dosage information; therefore, other information is added using the VCF parser. These multiple conversion steps may take a lot of time for reading, modifying, and writing data. Currently, as far as we know, Hail (see URLs) is the only software package that can be used for converting GEN files to VCF files. Hail uses Spark to read and write large data sets (Ganna et al. 2016; Khera et al. 2018). However, the implementation of a Spark-based system environment requires experts in related fields and a supercomputing resource for handling a large-scale dataset. Therefore, a fast and convenient GEN format to VCF conversion tool with DS support is warranted.In this paper, we present a new tool GEN2VCF, which converts the IMPUTE output in GEN format to VCF. GEN2VCF provides DS as well as GT and GP. GEN2VCF is a C-based software that converts GEN files faster than the existing pipelines and is efficient in handling large amounts of data with low memory usage. GEN2VCF also has options for standard input and output of processing data. This feature is particularly useful in implementing GEN2VCF with various different software packages by piping and redirection. We compared the performance of GEN2VCF with three possible pipelines by using combinations of three converting tools (BCFtools, QCTOOL, and Oncofunco) and a VCF parser (PySAM). A subset of chromosome 1 of the imputed data of 5000 samples was used as input data. To measure the performance, runtime and memory usage were used as measures.
Materials and methods
Implementation of GEN2VCF
GEN2VCF was implemented in the C programming language on Linux-based operating systems, which allows for large amounts of imputed data to be handled quickly. Memory usage is also relatively low compared to other programming languages (Fourment and Gillings 2008). All GEN2VCF commands are run in a Linux terminal. Given two alleles of A, B, there are three possible genotypes of a SNP: AA, AB, and BB. The A allele was regarded as the reference allele, and B allele as a coded allele (alternative allele). From the imputation output, the probability of each genotype is given by P(AA), P(AB), and P(BB). An imputed genotype dosage was estimated as 0 · P(AA) + 1 · P(AB) + 2 · (BB) (Hoffmann and Witte 2015). The dosage has a value between 0 and 2.
Comparison with other existing software packages
For the comparison analysis, we converted a GEN-formatted file, which is an output from IMPUTE software, to a VCF file with GT, GP, and DS. In the conversion from GEN format to VCF, the processes of GEN2VCF and existing software packages (BCFtools, QCTOOL, and Oncofunco) were displayed in Fig. 1. Briefly, there are three main steps during conversion processes: (1) the GEN file generated by IMPUTE is read, (2) dosages are calculated using genotype probabilities in the GEN file, and (3) an indexed compressed (bgzip) VCF file with GT, GP, and DS is generated. The basic characteristics of GEN2VCF and existing software packages are summarized in Table 1. Since the existing software alone do not have an option for handling dosage values for the conversion, an imputed genotype dosage was calculated using the VCF parser PySAM. On the other hand, GEN2VCF provides the conversion in a single process, thereby enabling more efficient analysis.
Fig. 1
Conversion processes of GEN2VCF and existing software packages
Table 1
Basic characteristics of methods used in this study
Conversion processes of GEN2VCF and existing software packagesBasic characteristics of methods used in this study
Performance test
For the experiment, we randomly sampled imputed data from a 1 Mb region on chromosome 1 from 5000 samples that was previously genotyped with the Korea Biobank Array (Moon et al. 2019). The 1 Mb genotype data were pre-phased using Eagle v2.3 (Loh et al. 2016) and imputed using Impute v4 (Bycroft et al. 2018) using the 1000 Genomes project phase 3 data as a reference panel (Auton et al. 2015). The imputed dataset consists of 13,891 variants. All experiments were performed on a computer with an Intel Xeon processors 3.47 GHz (12 cores), 66 GB of memory, and the Linux-based operating system Ubuntu 14.04.6. To measure the performances of GEN2VCF and other software packages, we used total runtime and maximum memory usage as performance measures. All tools were used with their default options in a single process.
Results
We performed a comparison analysis between GEN2VCF and possible three existing pipelines by using combinations of three converting tools (BCFtools, QCTOOL, and Oncofunco) and a VCF parser (PySAM). We converted a GEN-formatted file, which was an output from the IMPUTE software, to a VCF file with GT, GP, and DS. To determine the performance for various sample sizes, tests were performed from 1000 to 5000 samples with a step size of 1000. To determine the performance, total runtime and memory usage was used for each approach.The basic characteristics of the four methods used in this study are summarized in Table 1. BCFtools and QCTOOL only support the GT and GP of each genotype. Oncofunco outputs a VCF file with DS except GT and GP. Therefore, the VCF parser PySAM was used to combine VCF files with partial information to generate a VCF file with GT, GP, and DS.Figure 2 shows the total runtime of each method. As shown in the figure, GEN2VCF was the fastest among the four methods. The second fastest pipeline was Oncofunco and BCFtools used with PySAM. The runtime for generating a VCF file using QCTOOL and PySAM was the lowest of the four. However, GEN2VCF showed a 1.4–17-fold decrease in conversion time compared to the other pipelines.
Fig. 2
Runtime comparison among the four methods
Runtime comparison among the four methodsIn terms of memory usage during the conversion process, GEN2VCF had the least memory usage among the methods (Fig. 3). Oncofunco and Pysam use more memory than GEN2VCF to generate the VCF file. When using BCFtools and QCTOOL with PySAM, memory usage was comparable to other methods. For the conversion process, as the sample size increased, the difference in memory usage of other methods increased compared to that of GEN2VCF. When a 1 Mb GEN file with 5000 samples was used as the input, GEN2VCF showed a 7.4–1770-fold decrease in memory usage compared to other methods.
Fig. 3
Memory usage comparison among four methods
Memory usage comparison among four methods
Discussion
In this study, we developed a new tool to convert the IMPUTE output (GEN format) to VCF with GT, GP, and DS in a single process. The performance of GEN2VCF was compared with three possible pipelines using existing tools. As a result, GEN2VCF showed at least a 1.4-fold decrease in processing time during the conversion. Moreover, GEN2VCF showed the lowest memory usage; at least a 7.4-fold decrease in memory usage was observed when converting a 1 Mb GEN file of 5000 samples. The difference in memory usage was greater by increasing the number of samples for conversion. The memory usage is very important in cases handling millions of samples of whole genome imputed genotypes using a parallel computing environment. Since the maximum memory of a node of a parallel computing environment is limited, large memory usage may produce inefficiencies in the use of computing power for converting GEN files. The increased performance of GEN2VCF was achieved by programming a dedicated conversion software using a high-level C language, minimizing memory usage by processing GEN file line by line appending to a temporary buffer, fast conversion of floating point to string via a custom function. Our results showed that GEN2VCF is an efficient and convenient tool for converting a GEN file to a VCF file with GT, GP, and DS.In addition to the more efficient performance, GEN2VCF provides users with a convenient option of standard input and output for data processing. This feature is particularly useful in implementing GEN2VCF with various different software packages by piping and redirection. For example, an association test can be performed in a single command line by piping a GEN file management tool (i.e., QCTOOL), GEN2VCF, and association software supporting the VCF. Also, the application can be more efficient in managing storage space if used with a compressed imputation output. Imputed genotype data of millions of samples are typically hundreds of terabytes. For example, the BGEN format can significantly save storage space because it has a smaller file size than files with GEN format (Band and Marchini 2018; Bycroft et al. 2018). Indeed, about half a million samples of whole genome imputation data in the UK Biobank required about 2.1 Tb of file space (Bycroft et al. 2018). In a pipelined command, GEN2VCF can handle a standard output from QCTOOL (which converts BGEN files to GEN files), convert GEN format to VCF with GT, GP, and DS, and then the VCF data can also be redirected to other software packages.In conclusion, GEN2VCF provides users not only efficient conversion from GEN format to VCF with GT, GP, and DS, but also great flexibility in implementation with other software packages in a pipelined command.
Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis Journal: Nature Date: 2015-10-01 Impact factor: 49.962
Authors: Annalisa Buniello; Jacqueline A L MacArthur; Maria Cerezo; Laura W Harris; James Hayhurst; Cinzia Malangone; Aoife McMahon; Joannella Morales; Edward Mountjoy; Elliot Sollis; Daniel Suveges; Olga Vrousgou; Patricia L Whetzel; Ridwan Amode; Jose A Guillen; Harpreet S Riat; Stephen J Trevanion; Peggy Hall; Heather Junkins; Paul Flicek; Tony Burdett; Lucia A Hindorff; Fiona Cunningham; Helen Parkinson Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971
Authors: Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey C Barrett; Dorrett Boomsma; Kari Branham; Gerome Breen; Chad M Brummett; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura J Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James C Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine L Min; Karen L Mohlke; John B Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; J Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenherr; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard H Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul I W de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti; Nicole Soranzo; Klaudia Walter; Anand Swaroop; Francesco Cucca; Carl A Anderson; Richard M Myers; Michael Boehnke; Mark I McCarthy; Richard Durbin Journal: Nat Genet Date: 2016-08-22 Impact factor: 38.330
Authors: Andrea Ganna; Giulio Genovese; Daniel P Howrigan; Andrea Byrnes; Mitja Kurki; Seyedeh M Zekavat; Christopher W Whelan; Mart Kals; Michel G Nivard; Alex Bloemendal; Jonathan M Bloom; Jacqueline I Goldstein; Timothy Poterba; Cotton Seed; Robert E Handsaker; Pradeep Natarajan; Reedik Mägi; Diane Gage; Elise B Robinson; Andres Metspalu; Veikko Salomaa; Jaana Suvisaari; Shaun M Purcell; Pamela Sklar; Sekar Kathiresan; Mark J Daly; Steven A McCarroll; Patrick F Sullivan; Aarno Palotie; Tõnu Esko; Christina Hultman; Benjamin M Neale Journal: Nat Neurosci Date: 2016-10-03 Impact factor: 24.884
Authors: Amit V Khera; Mark Chaffin; Krishna G Aragam; Mary E Haas; Carolina Roselli; Seung Hoan Choi; Pradeep Natarajan; Eric S Lander; Steven A Lubitz; Patrick T Ellinor; Sekar Kathiresan Journal: Nat Genet Date: 2018-08-13 Impact factor: 38.330
Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini Journal: Nature Date: 2018-10-10 Impact factor: 49.962