MOTIVATION: Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package 'SeqArray' for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. RESULTS: Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF), 3.5 Gb (BGT) and 2.6 Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data. AVAILABILITY AND IMPLEMENTATION: http://www.bioconductor.org/packages/SeqArray. CONTACT: zhengx@u.washington.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package 'SeqArray' for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. RESULTS: Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF), 3.5 Gb (BGT) and 2.6 Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data. AVAILABILITY AND IMPLEMENTATION: http://www.bioconductor.org/packages/SeqArray. CONTACT: zhengx@u.washington.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583
Authors: Stephanie M Gogarten; Tamar Sofer; Han Chen; Chaoyu Yu; Jennifer A Brody; Timothy A Thornton; Kenneth M Rice; Matthew P Conomos Journal: Bioinformatics Date: 2019-12-15 Impact factor: 6.937
Authors: Chloé Sarnowski; Aaron Leong; Laura M Raffield; Peitao Wu; Paul S de Vries; Daniel DiCorpo; Xiuqing Guo; Huichun Xu; Yongmei Liu; Xiuwen Zheng; Yao Hu; Jennifer A Brody; Mark O Goodarzi; Bertha A Hidalgo; Heather M Highland; Deepti Jain; Ching-Ti Liu; Rakhi P Naik; Jeffrey R O'Connell; James A Perry; Bianca C Porneala; Elizabeth Selvin; Jennifer Wessel; Bruce M Psaty; Joanne E Curran; Juan M Peralta; John Blangero; Charles Kooperberg; Rasika Mathias; Andrew D Johnson; Alexander P Reiner; Braxton D Mitchell; L Adrienne Cupples; Ramachandran S Vasan; Adolfo Correa; Alanna C Morrison; Eric Boerwinkle; Jerome I Rotter; Stephen S Rich; Alisa K Manning; Josée Dupuis; James B Meigs Journal: Am J Hum Genet Date: 2019-09-26 Impact factor: 11.025
Authors: Taedong Yun; Helen Li; Pi-Chuan Chang; Michael F Lin; Andrew Carroll; Cory Y McLean Journal: Bioinformatics Date: 2021-01-05 Impact factor: 6.937
Authors: Aliou Dia; Catherine Jett; Simon G Trevino; Cindy S Chu; Kanlaya Sriprawat; Timothy J C Anderson; François Nosten; Ian H Cheeseman Journal: Cell Host Microbe Date: 2021-09-06 Impact factor: 31.316
Authors: Albert W Schulthess; Sandip M Kale; Fang Liu; Yusheng Zhao; Norman Philipp; Maximilian Rembe; Yong Jiang; Ulrike Beukert; Albrecht Serfling; Axel Himmelbach; Jörg Fuchs; Markus Oppermann; Stephan Weise; Philipp H G Boeven; Johannes Schacht; C Friedrich H Longin; Sonja Kollers; Nina Pfeiffer; Viktor Korzun; Matthias Lange; Uwe Scholz; Nils Stein; Martin Mascher; Jochen C Reif Journal: Nat Genet Date: 2022-10-04 Impact factor: 41.307
Authors: Jennifer A Brody; Alanna C Morrison; Joshua C Bis; Jeffrey R O'Connell; Michael R Brown; Jennifer E Huffman; Darren C Ames; Andrew Carroll; Matthew P Conomos; Stacey Gabriel; Richard A Gibbs; Stephanie M Gogarten; Namrata Gupta; Cashell E Jaquish; Andrew D Johnson; Joshua P Lewis; Xiaoming Liu; Alisa K Manning; George J Papanicolaou; Achilleas N Pitsillides; Kenneth M Rice; William Salerno; Colleen M Sitlani; Nicholas L Smith; Susan R Heckbert; Cathy C Laurie; Braxton D Mitchell; Ramachandran S Vasan; Stephen S Rich; Jerome I Rotter; James G Wilson; Eric Boerwinkle; Bruce M Psaty; L Adrienne Cupples Journal: Nat Genet Date: 2017-10-27 Impact factor: 38.330
Authors: Fang Liu; Yusheng Zhao; Sebastian Beier; Yong Jiang; Patrick Thorwarth; C Friedrich H Longin; Martin Ganal; Axel Himmelbach; Jochen C Reif; Albert W Schulthess Journal: Plant Biotechnol J Date: 2019-12-20 Impact factor: 9.803
Authors: Yajie Zhao; Stasa Stankovic; Mine Koprulu; Eleanor Wheeler; Felix R Day; Hana Lango Allen; Nicola D Kerrison; Maik Pietzner; Po-Ru Loh; Nicholas J Wareham; Claudia Langenberg; Ken K Ong; John R B Perry Journal: Nat Commun Date: 2021-07-07 Impact factor: 14.919