Literature DB >> 28334390

SeqArray-a storage-efficient high-performance data format for WGS variant calls.

Xiuwen Zheng1, Stephanie M Gogarten1, Michael Lawrence2, Adrienne Stilp1, Matthew P Conomos1, Bruce S Weir1, Cathy Laurie1, David Levine1.   

Abstract

MOTIVATION: Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package 'SeqArray' for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing.
RESULTS: Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF), 3.5 Gb (BGT) and 2.6 Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data.
AVAILABILITY AND IMPLEMENTATION: http://www.bioconductor.org/packages/SeqArray. CONTACT: zhengx@u.washington.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Entities:  

Mesh:

Year:  2017        PMID: 28334390      PMCID: PMC5860110          DOI: 10.1093/bioinformatics/btx145

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  26 in total

Review 1.  Estimating F-statistics.

Authors:  B S Weir; W G Hill
Journal:  Annu Rev Genet       Date:  2002-06-11       Impact factor: 16.830

2.  Model-free Estimation of Recent Genetic Relatedness.

Authors:  Matthew P Conomos; Alexander P Reiner; Bruce S Weir; Timothy A Thornton
Journal:  Am J Hum Genet       Date:  2016-01-07       Impact factor: 11.025

3.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2011-09-08       Impact factor: 6.937

Review 4.  Sequencing technologies - the next generation.

Authors:  Michael L Metzker
Journal:  Nat Rev Genet       Date:  2009-12-08       Impact factor: 53.242

5.  A new initiative on precision medicine.

Authors:  Francis S Collins; Harold Varmus
Journal:  N Engl J Med       Date:  2015-01-30       Impact factor: 91.245

6.  Eigenanalysis of SNP data with an identity by descent interpretation.

Authors:  Xiuwen Zheng; Bruce S Weir
Journal:  Theor Popul Biol       Date:  2015-10-23       Impact factor: 1.570

Review 7.  Coming of age: ten years of next-generation sequencing technologies.

Authors:  Sara Goodwin; John D McPherson; W Richard McCombie
Journal:  Nat Rev Genet       Date:  2016-05-17       Impact factor: 53.242

8.  Bioconductor: open software development for computational biology and bioinformatics.

Authors:  Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal:  Genome Biol       Date:  2004-09-15       Impact factor: 13.583

9.  Population structure and eigenanalysis.

Authors:  Nick Patterson; Alkes L Price; David Reich
Journal:  PLoS Genet       Date:  2006-12       Impact factor: 5.917

10.  Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT).

Authors:  Richard Durbin
Journal:  Bioinformatics       Date:  2014-01-09       Impact factor: 6.937

View more
  36 in total

1.  Genetic association testing using the GENESIS R/Bioconductor package.

Authors:  Stephanie M Gogarten; Tamar Sofer; Han Chen; Chaoyu Yu; Jennifer A Brody; Timothy A Thornton; Kenneth M Rice; Matthew P Conomos
Journal:  Bioinformatics       Date:  2019-12-15       Impact factor: 6.937

2.  Impact of Rare and Common Genetic Variants on Diabetes Diagnosis by Hemoglobin A1c in Multi-Ancestry Cohorts: The Trans-Omics for Precision Medicine Program.

Authors:  Chloé Sarnowski; Aaron Leong; Laura M Raffield; Peitao Wu; Paul S de Vries; Daniel DiCorpo; Xiuqing Guo; Huichun Xu; Yongmei Liu; Xiuwen Zheng; Yao Hu; Jennifer A Brody; Mark O Goodarzi; Bertha A Hidalgo; Heather M Highland; Deepti Jain; Ching-Ti Liu; Rakhi P Naik; Jeffrey R O'Connell; James A Perry; Bianca C Porneala; Elizabeth Selvin; Jennifer Wessel; Bruce M Psaty; Joanne E Curran; Juan M Peralta; John Blangero; Charles Kooperberg; Rasika Mathias; Andrew D Johnson; Alexander P Reiner; Braxton D Mitchell; L Adrienne Cupples; Ramachandran S Vasan; Adolfo Correa; Alanna C Morrison; Eric Boerwinkle; Jerome I Rotter; Stephen S Rich; Alisa K Manning; Josée Dupuis; James B Meigs
Journal:  Am J Hum Genet       Date:  2019-09-26       Impact factor: 11.025

3.  Accurate, scalable cohort variant calls using DeepVariant and GLnexus.

Authors:  Taedong Yun; Helen Li; Pi-Chuan Chang; Michael F Lin; Andrew Carroll; Cory Y McLean
Journal:  Bioinformatics       Date:  2021-01-05       Impact factor: 6.937

4.  Single-genome sequencing reveals within-host evolution of human malaria parasites.

Authors:  Aliou Dia; Catherine Jett; Simon G Trevino; Cindy S Chu; Kanlaya Sriprawat; Timothy J C Anderson; François Nosten; Ian H Cheeseman
Journal:  Cell Host Microbe       Date:  2021-09-06       Impact factor: 31.316

5.  Genomics-informed prebreeding unlocks the diversity in genebanks for wheat improvement.

Authors:  Albert W Schulthess; Sandip M Kale; Fang Liu; Yusheng Zhao; Norman Philipp; Maximilian Rembe; Yong Jiang; Ulrike Beukert; Albrecht Serfling; Axel Himmelbach; Jörg Fuchs; Markus Oppermann; Stephan Weise; Philipp H G Boeven; Johannes Schacht; C Friedrich H Longin; Sonja Kollers; Nina Pfeiffer; Viktor Korzun; Matthias Lange; Uwe Scholz; Nils Stein; Martin Mascher; Jochen C Reif
Journal:  Nat Genet       Date:  2022-10-04       Impact factor: 41.307

6.  nf-gwas-pipeline: A Nextflow Genome-Wide Association Study Pipeline.

Authors:  Zeyuan Song; Anastasia Gurinovich; Anthony Federico; Stefano Monti; Paola Sebastiani
Journal:  J Open Source Softw       Date:  2021-03-02

7.  Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology.

Authors:  Jennifer A Brody; Alanna C Morrison; Joshua C Bis; Jeffrey R O'Connell; Michael R Brown; Jennifer E Huffman; Darren C Ames; Andrew Carroll; Matthew P Conomos; Stacey Gabriel; Richard A Gibbs; Stephanie M Gogarten; Namrata Gupta; Cashell E Jaquish; Andrew D Johnson; Joshua P Lewis; Xiaoming Liu; Alisa K Manning; George J Papanicolaou; Achilleas N Pitsillides; Kenneth M Rice; William Salerno; Colleen M Sitlani; Nicholas L Smith; Susan R Heckbert; Cathy C Laurie; Braxton D Mitchell; Ramachandran S Vasan; Stephen S Rich; Jerome I Rotter; James G Wilson; Eric Boerwinkle; Bruce M Psaty; L Adrienne Cupples
Journal:  Nat Genet       Date:  2017-10-27       Impact factor: 38.330

8.  Exome association analysis sheds light onto leaf rust (Puccinia triticina) resistance genes currently used in wheat breeding (Triticum aestivum L.).

Authors:  Fang Liu; Yusheng Zhao; Sebastian Beier; Yong Jiang; Patrick Thorwarth; C Friedrich H Longin; Martin Ganal; Axel Himmelbach; Jochen C Reif; Albert W Schulthess
Journal:  Plant Biotechnol J       Date:  2019-12-20       Impact factor: 9.803

9.  Sparse Allele Vectors and the Savvy Software Suite.

Authors:  Jonathon LeFaive; Albert V Smith; Hyun Min Kang; Gonçalo Abecasis
Journal:  Bioinformatics       Date:  2021-05-14       Impact factor: 6.931

10.  GIGYF1 loss of function is associated with clonal mosaicism and adverse metabolic health.

Authors:  Yajie Zhao; Stasa Stankovic; Mine Koprulu; Eleanor Wheeler; Felix R Day; Hana Lango Allen; Nicola D Kerrison; Maik Pietzner; Po-Ru Loh; Nicholas J Wareham; Claudia Langenberg; Ken K Ong; John R B Perry
Journal:  Nat Commun       Date:  2021-07-07       Impact factor: 14.919

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.