Literature DB >> 25495213

Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration.

Patrick Deelen¹, Marc Jan Bonder, K Joeri van der Velde, Harm-Jan Westra, Erwin Winder, Dennis Hendriksen, Lude Franke, Morris A Swertz.

Abstract

BACKGROUND: To gain statistical power or to allow fine mapping, researchers typically want to pool data before meta-analyses or genotype imputation. However, the necessary harmonization of genetic datasets is currently error-prone because of many different file formats and lack of clarity about which genomic strand is used as reference.
FINDINGS: Genotype Harmonizer (GH) is a command-line tool to harmonize genetic datasets by automatically solving issues concerning genomic strand and file format. GH solves the unknown strand issue by aligning ambiguous A/T and G/C SNPs to a specified reference, using linkage disequilibrium patterns without prior knowledge of the used strands. GH supports many common GWAS/NGS genotype formats including PLINK, binary PLINK, VCF, SHAPEIT2 & Oxford GEN. GH is implemented in Java and a large part of the functionality can also be used as Java 'Genotype-IO' API. All software is open source under license LGPLv3 and available from http://www.molgenis.org/systemsgenetics.
CONCLUSIONS: GH can be used to harmonize genetic datasets across different file formats and can be easily integrated as a step in routine meta-analysis and imputation pipelines.

Entities: Disease Gene Mutation Species

Mesh：

Year: 2014 PMID： 25495213 PMCID： PMC4307387 DOI： 10.1186/1756-0500-7-901

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Background

Genome-wide association studies (GWAS) increasingly require the integration of multiple genetic data sets to reach sufficient resolution and statistical power, either by imputing missing genotypes or by pooling datasets for a meta-analysis. However, there are two major challenges to be resolved: 1) the large number of different file formats used by the genetics community, and 2) the ambiguous A/T and G/C single nucleotide polymorphisms (SNPs) for which the strand is not obvious. For many statistical analyses, such as meta-analyses of GWAS [1] and genotype imputation [2], it is vital that the datasets to be used are aligned to the same genomic strand. Genotype data can be coded on either the forward genomic strand or the reverse genomic strand (e.g. a SNP coded T/G on the forward strand would be coded A/C on the reverse strand). The strand used to store the genotypes is not always the same within a dataset (i.e. the same strand may not be used for all variants) or between the different datasets to be aligned (i.e. the same strand may not be used for a variant present in both datasets); these differences can be intentional [3] or accidental. To complicate matters, most of the common file formats do not define the strand used. For some types of SNPs, it is fairly straightforward to detect and correct the strand differences. For example, a T/G SNP is non-ambiguous as its complement on the other strand is A/C. However, G/C and T/A variants are ambiguous or cryptic as their complementary alleles are C/G and A/T, respectively. This ambiguity means it is more difficult to detect and resolve strand issues for these SNPs. Of course, it is possible to simply exclude all ambiguous variants, however, modern genotyping chips often contain many A/T and G/C SNPs; the ImmunoChip has 25,740 such SNPs (1.7% of all SNPs), the ExomeChip 244,771 (11.9%) and the Omni5-quad 144.578 (3.4%). Simply excluding these variants will limit the power of a GWAS meta-analysis where the A/T or G/C variant is the causal variant or is in higher LD to the causal variant. In the case of imputation it has also been shown that more input genotypes yield imputed genotypes of higher quality [4], so if it is possible to include the A/T and G/C variants, this is more desirable. In the cases where the strand of the genotypes is known, there are many solutions to easily correct the strands of one dataset or to simply state explicitly the strand used, for example as is possible in IMPUTE2 [5] or METAL [6]. In practice, however, this information is not always available or trustworthy. One solution to the problem of unknown strands is to compare the minor allele between two datasets. However, use of the minor allele is not ideal as it can differ between datasets and populations, especially for common variants. PLINK [7] employs a more powerful approach to detect strand inconsistencies between cases and controls. However, this method requires many manual steps, re-coding of phenotypes before and after the actual alignment, manual alignment of the non-ambiguous SNPs and merging the data into one dataset, and finally a script needs to be written to parse the alignment results from PLINK to determine the actual alignment. When using PLINK, it is not possible to align genotypes with posterior probabilities.

Implementation

Here, we present Genotype Harmonizer (GH): a new command-line tool to automate genotype data harmonization. GH can read commonly used file formats (PLINK, binary PLINK, VCF, SHAPEIT2 & Oxford GEN) and align a study dataset to a specified reference without any prior knowledge of the strand used. After alignment, GH writes data back to a chosen format (PLINK, binary PLINK, SHAPEIT2 or Oxford GEN). All handling of the genotype data and loading genotypes from the different formats is implemented in our Genotype IO library, which also allows integration of the harmonization tools into other software. GH consists of 25,000 lines of code with a high unit test coverage of over 60% at conditional level and continuous build testing. GH is written in Java and has been tested under Linux, Windows, and OS-X. All source code is available at http://www.github.com/molgenis/systemsgenetics. GH implements a fully automated method that assigns the strand of ambiguous SNPs by selecting nearby non-ambiguous SNPs that are in linkage disequilibrium (LD) in both the study data and the reference data. GH correlates the estimated haplotype frequencies between the study data and the reference data. If GH finds more negative correlations than positive ones in haplotype frequencies, the ambiguous SNP is swapped to the other strand. When GH is unable to align a SNP (e.g. because of a lack of surrounding SNPs), this ambiguous SNP is excluded from the set. It is possible to prevent exclusion of variants that could not be aligned using LD, GH can optionally perform alignment using the minor allele for variants that have a minor allele frequency below a specified value.

Findings

Usage in an imputation workflow

We advise applying GH to pre-phased data before imputation. When pre-phasing using SHAPEIT2 [8] and imputing using IMPUTE2, GH can read the SHAPEIT2 output directly and can write aligned results in the same format for direct use by IMPUTE2 (Figure 1). Performing the alignment after the pre-phasing step ensures that pre-phasing does not need to be repeated when imputing using a different reference set or a newer version of a reference set. GH can also update the variant identifiers of the study data to match the reference set identifiers using the --update-id option. An example command is:

Figure 1

Usage of Genotype Harmonizer. A) GH can be applied after the pre-phasing of the genotypes, preventing the need to redo the phasing for each new version of a haplotype reference set. B) GH can be used to align and reformat genotype datasets allowing easy merging or meta-analysing of data. By aligning all datasets to a public reference, the genotype data can be kept private by consortia members.

Usage to harmonize GWAS data

GH can also be used in merging or meta-analysis of different GWAS datasets (Figure 1). One of the datasets can be used as a reference and the other datasets can be aligned to it, or all the cohorts can be aligned to a public reference set. It is possible to include all the variants present in the study data that are not in the reference set using the --keep option. After alignment the datasets can be investigated using a meta-analysis or can be merged into a single dataset. An example command is: GenotypeHarmonizer.sh --input dataset1 --ref dataset2 --output dataset1Aligned --update-id --keep

Performance

GH requires 6:35 minutes to align a GWAS dataset consisting of 168,408 SNPs and 25,169 samples in binary PLINK format to another GWAS dataset with 528,969 SNPs and 11,950 samples, using a Linux system, a single core and 4 GB of RAM. Aligning the SHAPEIT2 results (25,169 and 19,321 variants on chromosome 1) to the Genome of The Netherlands imputation reference (499 samples, 1,536,126 SNPs on chromosome 1) [9] took 36 seconds using a single core and <1 GB of RAM.

Comparison using PLINK alignment

We compared the alignment of ambiguous variants using GH to the alignment using the flip-scan option in PLINK. We performed this analysis by using the latest HapMap3 data. We randomly assigned the samples into two equally sized sets, henceforth denoted as set1 and set2. In set1 we randomly changed the strand of roughly 50% of the A/T and G/C variants. Set1 was aligned using GH by using set2 as the reference using the default settings. We successful aligned 40,617 out of the 55,517 swapped variants, 14 (0.03%) variants were aligned to the incorrect strand. In total 29,801 A/T and G/C variants (27% of the total ambiguous variants) were excluded since there were not enough variants in LD for accurate alignment. There were no variants swapped by GH that were not flipped in our test set. For the analysis using PLINK we denoted the samples in set1 as cases and set2 as controls; we merged both sets and used the flip-scan option using the default settings. PLINK does not actually report which variants should be swapped but instead provides a log with information on which the decision to swap a variant can be based. Since the PLINK manual does not provide a recommendation on how to select the variants to swap based on this file, we used the same criteria as those used by the GH, i.e. there need to be at least 3 variants in LD, and then we assessed if there were more positive than negative correlations. This resulted in the successful alignment of 37,402 SNPs and the incorrect alignment of 54 SNPs (0.14%); 36,390 (33% of the total ambiguous variants) variants were excluded because of lack of variants in LD. We thus find that the number of incorrectly aligned SNPs increased by 40 SNPs and the number of excluded SNPs increased by 22% from 29,801 to 36,390 when using PLINK instead of GH. Moreover, in one command GH covers many separate steps which require considerable manual work or scripting when using PLINK: manual alignment of non-ambiguous variants (which PLINK cannot do automatically), conversion of reference haplotypes to a PLINK supported format, merging the reference and study datasets, recoding using a fake phenotype file, running PLINK flip-scan to find swapped SNPs, and the selection and swapping of the SNPs on the wrong strand.

Conclusions

We have shown that using Genotype Harmonizer we can provide near perfect alignment of ambiguous SNPs without any prior knowledge of the strands. Compared to PLINK we have improved the strand alignment and limited the number of manual steps without sacrificing run-time performance. Another advantage of GH over PLINK is our support of file formats storing haplotype phase or genotype probability information, which also makes our software useful to employ within an imputation workflow or on data that has already been imputed. GH uses an advanced LD-based method to perform the alignment of ambiguous SNPs and supports many genotype file formats. The underlying Genotype IO API is part of the MOLGENIS open source suite [10], which is also used by several other genetic analysis tools, and we expect the number of supported formats to grow in the future. These enhancements will be made available in later releases of GH. We have used GH to harmonize over 15 imputations and GWAS datasets [11-14]. GH is now a standard part of our imputations and has been applied to over 25,000 samples (publications in preparation). We expect GH to be a major time saver for many research groups and to become a standard part of many analysis pipelines, as it alleviates manual steps when imputing data or when working with multiple GWAS datasets.

Availability and requirements

Project name: Genotype Harmonizer Project home page:http://www.molgenis.org/systemsgenetics Operating system(s): Platform independent Programming language: Java Other requirements: Java 1.6 or higher License: LGPLv3 Any restrictions to use by non-academics: Free to use

12 in total

Review 1. Genotype imputation for genome-wide association studies.

Authors: Jonathan Marchini; Bryan Howie
Journal: Nat Rev Genet Date: 2010-07 Impact factor: 53.242

2. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

Review 3. Meta-analysis methods for genome-wide association studies and beyond.

Authors: Evangelos Evangelou; John P A Ioannidis
Journal: Nat Rev Genet Date: 2013-05-09 Impact factor: 53.242

4. Whole-genome sequence variation, population structure and demographic history of the Dutch population.

Authors:
Journal: Nat Genet Date: 2014-06-29 Impact factor: 38.330

5. The MOLGENIS toolkit: rapid prototyping of biosoftware at the push of a button.

Authors: Morris A Swertz; Martijn Dijkstra; Tomasz Adamusiak; Joeri K van der Velde; Alexandros Kanterakis; Erik T Roos; Joris Lops; Gudmundur A Thorisson; Danny Arends; George Byelas; Juha Muilu; Anthony J Brookes; Engbert O de Brock; Ritsert C Jansen; Helen Parkinson
Journal: BMC Bioinformatics Date: 2010-12-21 Impact factor: 3.169

6. METAL: fast and efficient meta-analysis of genomewide association scans.

Authors: Cristen J Willer; Yun Li; Gonçalo R Abecasis
Journal: Bioinformatics Date: 2010-07-08 Impact factor: 6.937

7. Fine mapping of the celiac disease-associated LPP locus reveals a potential functional variant.

Authors: Rodrigo Almeida; Isis Ricaño-Ponce; Vinod Kumar; Patrick Deelen; Agata Szperl; Gosia Trynka; Javier Gutierrez-Achury; Alexandros Kanterakis; Harm-Jan Westra; Lude Franke; Morris A Swertz; Mathieu Platteel; Jose Ramon Bilbao; Donatella Barisani; Luigi Greco; Luisa Mearin; Victorien M Wolters; Chris Mulder; Maria Cristina Mazzilli; Ajit Sood; Bozena Cukrowska; Concepción Núñez; Riccardo Pratesi; Sebo Withoff; Cisca Wijmenga
Journal: Hum Mol Genet Date: 2013-12-11 Impact factor: 6.150

8. Impact of pre-imputation SNP-filtering on genotype imputation results.

Authors: Nab Raj Roshyara; Holger Kirsten; Katrin Horn; Peter Ahnert; Markus Scholz
Journal: BMC Genet Date: 2014-08-12 Impact factor: 2.797

9. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Authors: Bryan N Howie; Peter Donnelly; Jonathan Marchini
Journal: PLoS Genet Date: 2009-06-19 Impact factor: 5.917

10. Genetic and epigenetic regulation of gene expression in fetal and adult human livers.

Authors: Marc Jan Bonder; Silva Kasela; Mart Kals; Riin Tamm; Kaie Lokk; Isabel Barragan; Wim A Buurman; Patrick Deelen; Jan-Willem Greve; Maxim Ivanov; Sander S Rensen; Jana V van Vliet-Ostaptchouk; Marcel G Wolfs; Jingyuan Fu; Marten H Hofker; Cisca Wijmenga; Alexandra Zhernakova; Magnus Ingelman-Sundberg; Lude Franke; Lili Milani
Journal: BMC Genomics Date: 2014-10-04 Impact factor: 3.969

46 in total

1. Functional implications of disease-specific variants in loci jointly associated with coeliac disease and rheumatoid arthritis.

Authors: Javier Gutierrez-Achury; Maria Magdalena Zorro; Isis Ricaño-Ponce; Daria V Zhernakova; Dorothée Diogo; Soumya Raychaudhuri; Lude Franke; Gosia Trynka; Cisca Wijmenga; Alexandra Zhernakova
Journal: Hum Mol Genet Date: 2015-11-05 Impact factor: 6.150

2. Population-specific genotype imputations using minimac or IMPUTE2.

Authors: Elisabeth M van Leeuwen; Alexandros Kanterakis; Patrick Deelen; Mathijs V Kattenberg; P Eline Slagboom; Paul I W de Bakker; Cisca Wijmenga; Morris A Swertz; Dorret I Boomsma; Cornelia M van Duijn; Lennart C Karssen; Jouke Jan Hottenga
Journal: Nat Protoc Date: 2015-07-30 Impact factor: 13.491

3. Urban living in healthy Tanzanians is associated with an inflammatory status driven by dietary and metabolic changes.

Authors: Godfrey S Temba; Vesla Kullaya; Tal Pecht; Blandina T Mmbaga; Anna C Aschenbrenner; Thomas Ulas; Gibson Kibiki; Furaha Lyamuya; Collins K Boahen; Vinod Kumar; Leo A B Joosten; Joachim L Schultze; Andre J van der Ven; Mihai G Netea; Quirijn de Mast
Journal: Nat Immunol Date: 2021-02-11 Impact factor: 25.606

4. Disease variants alter transcription factor levels and methylation of their binding sites.

Authors: Marc Jan Bonder; René Luijk; Daria V Zhernakova; Matthijs Moed; Patrick Deelen; Martijn Vermaat; Maarten van Iterson; Freerk van Dijk; Michiel van Galen; Jan Bot; Roderick C Slieker; P Mila Jhamai; Michael Verbiest; H Eka D Suchiman; Marijn Verkerk; Ruud van der Breggen; Jeroen van Rooij; Nico Lakenberg; Wibowo Arindrarto; Szymon M Kielbasa; Iris Jonkers; Peter van 't Hof; Irene Nooren; Marian Beekman; Joris Deelen; Diana van Heemst; Alexandra Zhernakova; Ettje F Tigchelaar; Morris A Swertz; Albert Hofman; André G Uitterlinden; René Pool; Jenny van Dongen; Jouke J Hottenga; Coen D A Stehouwer; Carla J H van der Kallen; Casper G Schalkwijk; Leonard H van den Berg; Erik W van Zwet; Hailiang Mei; Yang Li; Mathieu Lemire; Thomas J Hudson; P Eline Slagboom; Cisca Wijmenga; Jan H Veldink; Marleen M J van Greevenbroek; Cornelia M van Duijn; Dorret I Boomsma; Aaron Isaacs; Rick Jansen; Joyce B J van Meurs; Peter A C 't Hoen; Lude Franke; Bastiaan T Heijmans
Journal: Nat Genet Date: 2016-12-05 Impact factor: 38.330

5. Identification of context-dependent expression quantitative trait loci in whole blood.

Authors: Daria V Zhernakova; Patrick Deelen; Martijn Vermaat; Maarten van Iterson; Michiel van Galen; Wibowo Arindrarto; Peter van 't Hof; Hailiang Mei; Freerk van Dijk; Harm-Jan Westra; Marc Jan Bonder; Jeroen van Rooij; Marijn Verkerk; P Mila Jhamai; Matthijs Moed; Szymon M Kielbasa; Jan Bot; Irene Nooren; René Pool; Jenny van Dongen; Jouke J Hottenga; Coen D A Stehouwer; Carla J H van der Kallen; Casper G Schalkwijk; Alexandra Zhernakova; Yang Li; Ettje F Tigchelaar; Niek de Klein; Marian Beekman; Joris Deelen; Diana van Heemst; Leonard H van den Berg; Albert Hofman; André G Uitterlinden; Marleen M J van Greevenbroek; Jan H Veldink; Dorret I Boomsma; Cornelia M van Duijn; Cisca Wijmenga; P Eline Slagboom; Morris A Swertz; Aaron Isaacs; Joyce B J van Meurs; Rick Jansen; Bastiaan T Heijmans; Peter A C 't Hoen; Lude Franke
Journal: Nat Genet Date: 2016-12-05 Impact factor: 38.330

6. Identification of 38 novel loci for systemic lupus erythematosus and genetic heterogeneity between ancestral groups.

Authors: Yong-Fei Wang; Yan Zhang; Zhiming Lin; Huoru Zhang; Ting-You Wang; Yujie Cao; David L Morris; Yujun Sheng; Xianyong Yin; Shi-Long Zhong; Xiaoqiong Gu; Yao Lei; Jing He; Qi Wu; Jiangshan Jane Shen; Jing Yang; Tai-Hing Lam; Jia-Huang Lin; Zhi-Ming Mai; Mengbiao Guo; Yuanjia Tang; Yanhui Chen; Qin Song; Bo Ban; Chi Chiu Mok; Yong Cui; Liangjing Lu; Nan Shen; Pak C Sham; Chak Sing Lau; David K Smith; Timothy J Vyse; Xuejun Zhang; Yu Lung Lau; Wanling Yang
Journal: Nat Commun Date: 2021-02-03 Impact factor: 14.919

7. Molgenis-impute: imputation pipeline in a box.

Authors: Alexandros Kanterakis; Patrick Deelen; Freerk van Dijk; Heorhiy Byelas; Martijn Dijkstra; Morris A Swertz
Journal: BMC Res Notes Date: 2015-08-19

8. Common variation near IRF6 is associated with IFN-β-induced liver injury in multiple sclerosis.

Authors: Kaarina Kowalec; Galen E B Wright; Britt I Drögemöller; Folefac Aminkeng; Amit P Bhavsar; Elaine Kingwell; Eric M Yoshida; Anthony Traboulsee; Ruth Ann Marrie; Marcelo Kremenchutzky; Trudy L Campbell; Pierre Duquette; Naga Chalasani; Mia Wadelius; Pär Hallberg; Zongqi Xia; Philip L De Jager; Joshua C Denny; Mary F Davis; Colin J D Ross; Helen Tremlett; Bruce C Carleton
Journal: Nat Genet Date: 2018-07-16 Impact factor: 38.330

9. Unravelling the GSK3β-related genotypic interaction network influencing hippocampal volume in recurrent major depressive disorder.

Authors: Becky Inkster; Andy Simmons; James H Cole; Erwin Schoof; Rune Linding; Tom Nichols; Pierandrea Muglia; Florian Holsboer; Philipp G Sämann; Peter McGuffin; Cynthia H Y Fu; Kamilla Miskowiak; Paul M Matthews; Gwyneth Zai; Kristin Nicodemus
Journal: Psychiatr Genet Date: 2018-10 Impact factor: 2.458

10. Inter-individual variability and genetic influences on cytokine responses to bacteria and fungi.

Authors: Yang Li; Marije Oosting; Patrick Deelen; Isis Ricaño-Ponce; Sanne Smeekens; Martin Jaeger; Vasiliki Matzaraki; Morris A Swertz; Ramnik J Xavier; Lude Franke; Cisca Wijmenga; Leo A B Joosten; Vinod Kumar; Mihai G Netea
Journal: Nat Med Date: 2016-07-04 Impact factor: 53.440