Literature DB >> 33300997

Sparse Project VCF: efficient encoding of population genotype matrices.

Michael F Lin1, Xiaodong Bai2, William J Salerno2, Jeffrey G Reid2.   

Abstract

SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts.
AVAILABILITY AND IMPLEMENTATION: Apache-licensed reference implementation: github.com/mlin/spVCF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2020. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2021        PMID: 33300997      PMCID: PMC8016461          DOI: 10.1093/bioinformatics/btaa1004

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Variant Call Format (VCF) is the prevailing representation for small germline variants discovered by high-throughput sequencing (Danecek ). In addition to capturing variants sequenced in one study participant, VCF can represent the genotypes for many participants at all discovered variant loci. This ‘Project VCF’ (pVCF) form is a 2-D matrix with loci down the rows and participants across the columns, filled in with each called genotype and annotations thereof, including quality-control (QC) measures like read depth, strand ratio and genotype likelihoods. As the number of study participants N grows (columns), more variant loci are also discovered (rows), leading to super-linear growth of the pVCF genotype matrix. And, because cohort sequencing discovers mostly rare variants, this matrix consists largely of reference-identical genotypes and their high-entropy QC measures. In recent experiments with human whole-exome sequencing (WES), doubling N from 25 000 to 50 000 also increased the pVCF locus count by 43%, and 96% of all loci had non-reference allele frequency below 0.1% (Lin ). Empirically, vcf.gz file sizes in WES and whole-genome sequencing (WGS) are growing roughly with in the largest studies as of this writing ( WES). Unchecked, we project WGS will yield petabytes of compressed pVCF.

2 Approach

We sought an incremental solution to these challenges for existing pVCF-based pipelines, which may be reluctant to adopt fundamentally different formats or data models (Danek and Deorowicz, 2018; Deorowicz and Danek, 2019; Lan ; Layer et al., 2015; Li, 2016; Zheng ; Supplementary Appendix S1) to minimize disruption to existing processes and users. To this end, we developed Sparse Project VCF (spVCF), which adds three simple features to VCF ():
Fig. 1.

spVCF encoding example. (A) Illustrative pVCF of four variant loci in three sequenced study participants, with matrix entries encoding called genotypes and several numeric QC measures. Some required VCF fields are omitted for brevity. (B) spVCF encoding of the same example. QC values for reference-identical and non-called cells are reduced to a power-of-two lower bound on read depth DP. Runs of identical entries down columns are abbreviated using quotation marks, then runs of these marks across rows are length-encoded. Cy’s entries are shown column-aligned for clarity; the encoded text matrix is ragged

spVCF encoding example. (A) Illustrative pVCF of four variant loci in three sequenced study participants, with matrix entries encoding called genotypes and several numeric QC measures. Some required VCF fields are omitted for brevity. (B) spVCF encoding of the same example. QC values for reference-identical and non-called cells are reduced to a power-of-two lower bound on read depth DP. Runs of identical entries down columns are abbreviated using quotation marks, then runs of these marks across rows are length-encoded. Cy’s entries are shown column-aligned for clarity; the encoded text matrix is ragged Squeezing: judiciously reducing QC entropy. In those cells with zero reads supporting a variant (typically Allele Depth for any d) and corresponding non-variant genotype, we discard all fields except the genotype GT and the read depth DP, which we also round down to a power of two (0, 1, 2, 4, 8, 16,…; configurable). Any cell reporting evidence of variation retains its original QC measures and other annotations. This convention, inspired by common base quality score compression techniques, aims to preserve nearly all useful information, removing minor fluctuations in non-variant cells. (If required for compatibility, non-variant genotype likelihoods could be approximated from depth, albeit without read quality inputs that might subtly affect downstream calculations.) Succinct, lossless encoding for runs of reference-identical cells. First, we replace the contents of a reference-identical (or non-called) cell with a double-quotation mark if it’s identical to the cell above it, compressing runs down the column for each sample. Then we run-length encode these quotation marks across the rows, so for example a stretch of 42 marks across a row is written "42 instead of repeating ". The second, horizontal run-encoding step has negligible effect on zipped size, but should enable faster downstream processing, e.g. sample subset extraction. The QC squeezing synergizes with the run-encoding, by converting minor fluctuations into identical runs down each column. Checkpointing to facilitate random access by genome range (row). While all variant genotype cells are readily accessible from a given spVCF row, fully decoding the remaining cells could require information from an arbitrary number of prior rows. Instead, the spVCF encoder periodically skips run-encoding a row, emitting a row identical to the squeezed pVCF. Each run-encoded row indicates the position of the last such checkpoint row, from which decoding can commence. Our Apache-licensed Unix tool spvcf provides subcommands to (i) squeeze and run-encode pVCF to spVCF, (ii) squeeze pVCF without run-encoding (producing valid pVCF usually much smaller, albeit not as small as spVCF) or (iii) decode spVCF back to pVCF. If a spVCF file is compressed using bgzip, then tabix can create an index for it (Li, 2011) based on the unchanged locus-level VCF fields. A subcommand of spvcf used instead of tabix can then access the file by genome position, generating a standalone spVCF slice.

3 spVCF for DiscovEHR and UK Biobank

We tested spVCF on two large WES studies based on different upstream variant-calling pipelines. First, using WES from the DiscovEHR study (Dewey ), we reduced a GATK-based pVCF file with chromosome 2 variant loci from 79GiB vcf.gz to a 5.2GiB spvcf.gz file, 15× size reduction. Most of this reduction (6.9×) was achieved by the QC squeezing, while the run-encoding contributed 2.2×. Experiments with nested subsets of these WES indicate spvcf.gz file sizes growing roughly with , compared to the original’s (Supplementary Fig. S1). VCF’s binary equivalent, BCF, reduces this example by 1.2× losslessly and exhibits the same scaling. Second, with WES from UK Biobank (Van Hout ), spVCF reduced vcf.gz files for loci in ten representative chromosome 2 segments from 110 to 7.7 GiB (Supplementary Table S1). This 14× combined ratio is similar to that achieved for DiscovEHR; decomposed however, QC squeezing was relatively less impactful (4.2×) and run-encoding relatively moreso (3.4×). On the one hand, the UK Biobank pVCF files were produced using a different upstream pipeline (‘SPB’) that already omitted genotype likelihoods for most reference-identical cells, leaving less to be squeezed out compared to DiscovEHR. On the other hand, the run-encoding’s effectiveness improved along with the 3.3×-higher variant locus density in the larger cohort, a trend expected to continue with larger N. In single-threaded tests (Supplementary Appendix S2), spvcf encoded raw pVCF slightly faster than bgzip compresses it (both tools also have multithread modes). The decoder, with inputs and outputs both much smaller than the original pVCF, is several times faster. This makes it feasible to store spVCF files and decode them to pVCF only for transient use when needed.

4 Discussion

spVCF is practical ‘next step’ for storage and transfer in ongoing cohort sequencing projects, delivering far-reduced size growth and performant interoperability with existing pipelines. Upstream, joint-calling tools can stream their output pVCF into spvcf for now, and perhaps eventually generate spVCF natively. Downstream, population analysis tools can stream decoded pVCF from spvcf, with the future possibility of consuming spVCF directly. spVCF clears a path to scale up the VCF data model to WGS studies, notwithstanding residual super-linear size growth likely due to multiallelic loci and depth fluctuations. Meanwhile, many investigators—pacing with new sequencing technologies—are developing haplotype-centric paradigms that might eventually replace VCF. Financial Support: XB, WJS, and JGR are employees of Regeneron Pharmaceuticals, Inc. Conflict of Interest: none declared. Click here for additional data file.
  10 in total

1.  Tabix: fast retrieval of sequence features from generic TAB-delimited files.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2011-01-05       Impact factor: 6.937

2.  BGT: efficient and flexible genotype query across many samples.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2015-10-24       Impact factor: 6.937

3.  GTShark: genotype compression in large projects.

Authors:  Sebastian Deorowicz; Agnieszka Danek
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

4.  Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study.

Authors:  Frederick E Dewey; Michael F Murray; John D Overton; Lukas Habegger; Joseph B Leader; Samantha N Fetterolf; Colm O'Dushlaine; Cristopher V Van Hout; Jeffrey Staples; Claudia Gonzaga-Jauregui; Raghu Metpally; Sarah A Pendergrass; Monica A Giovanni; H Lester Kirchner; Suganthi Balasubramanian; Noura S Abul-Husn; Dustin N Hartzel; Daniel R Lavage; Korey A Kost; Jonathan S Packer; Alexander E Lopez; John Penn; Semanti Mukherjee; Nehal Gosalia; Manoj Kanagaraj; Alexander H Li; Lyndon J Mitnaul; Lance J Adams; Thomas N Person; Kavita Praveen; Anthony Marcketta; Matthew S Lebo; Christina A Austin-Tse; Heather M Mason-Suares; Shannon Bruse; Scott Mellis; Robert Phillips; Neil Stahl; Andrew Murphy; Aris Economides; Kimberly A Skelding; Christopher D Still; James R Elmore; Ingrid B Borecki; George D Yancopoulos; F Daniel Davis; William A Faucett; Omri Gottesman; Marylyn D Ritchie; Alan R Shuldiner; Jeffrey G Reid; David H Ledbetter; Aris Baras; David J Carey
Journal:  Science       Date:  2016-12-23       Impact factor: 47.728

5.  GTC: how to maintain huge genotype collections in a compressed form.

Authors:  Agnieszka Danek; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2018-06-01       Impact factor: 6.937

6.  Efficient genotype compression and analysis of large genetic-variation data sets.

Authors:  Ryan M Layer; Neil Kindlon; Konrad J Karczewski; Aaron R Quinlan
Journal:  Nat Methods       Date:  2015-11-09       Impact factor: 28.547

7.  SeqArray-a storage-efficient high-performance data format for WGS variant calls.

Authors:  Xiuwen Zheng; Stephanie M Gogarten; Michael Lawrence; Adrienne Stilp; Matthew P Conomos; Bruce S Weir; Cathy Laurie; David Levine
Journal:  Bioinformatics       Date:  2017-08-01       Impact factor: 6.937

8.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

9.  genozip: a fast and efficient compression tool for VCF files.

Authors:  Divon Lan; Raymond Tobler; Yassine Souilmi; Bastien Llamas
Journal:  Bioinformatics       Date:  2020-07-01       Impact factor: 6.937

10.  Exome sequencing and characterization of 49,960 individuals in the UK Biobank.

Authors:  Cristopher V Van Hout; Ioanna Tachmazidou; Joshua D Backman; Joshua D Hoffman; Daren Liu; Ashutosh K Pandey; Claudia Gonzaga-Jauregui; Shareef Khalid; Bin Ye; Nilanjana Banerjee; Alexander H Li; Colm O'Dushlaine; Anthony Marcketta; Jeffrey Staples; Claudia Schurmann; Alicia Hawes; Evan Maxwell; Leland Barnard; Alexander Lopez; John Penn; Lukas Habegger; Andrew L Blumenfeld; Xiaodong Bai; Sean O'Keeffe; Ashish Yadav; Kavita Praveen; Marcus Jones; William J Salerno; Wendy K Chung; Ida Surakka; Cristen J Willer; Kristian Hveem; Joseph B Leader; David J Carey; David H Ledbetter; Lon Cardon; George D Yancopoulos; Aris Economides; Giovanni Coppola; Alan R Shuldiner; Suganthi Balasubramanian; Michael Cantor; Matthew R Nelson; John Whittaker; Jeffrey G Reid; Jonathan Marchini; John D Overton; Robert A Scott; Gonçalo R Abecasis; Laura Yerges-Armstrong; Aris Baras
Journal:  Nature       Date:  2020-10-21       Impact factor: 69.504

  10 in total
  3 in total

1.  Accurate, scalable cohort variant calls using DeepVariant and GLnexus.

Authors:  Taedong Yun; Helen Li; Pi-Chuan Chang; Michael F Lin; Andrew Carroll; Cory Y McLean
Journal:  Bioinformatics       Date:  2021-01-05       Impact factor: 6.937

2.  A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar.

Authors:  Erik Garrison; Zev N Kronenberg; Eric T Dawson; Brent S Pedersen; Pjotr Prins
Journal:  PLoS Comput Biol       Date:  2022-05-31       Impact factor: 4.779

3.  Sparse Allele Vectors and the Savvy Software Suite.

Authors:  Jonathon LeFaive; Albert V Smith; Hyun Min Kang; Gonçalo Abecasis
Journal:  Bioinformatics       Date:  2021-05-14       Impact factor: 6.931

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.