| Literature DB >> 35639788 |
Erik Garrison1, Zev N Kronenberg2, Eric T Dawson3, Brent S Pedersen4, Pjotr Prins1.
Abstract
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies-as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35639788 PMCID: PMC9286226 DOI: 10.1371/journal.pcbi.1009123
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.779
A selection of VCF processing tools included with vcflib (a full list of over 125 tools with full descriptions and options can be found in the online vcflib documentation [17]).
| Name | Description |
|---|---|
|
| |
|
| add info fields from a second VCF file for records missing in the first file. |
|
| insert AC and NS fields using sample genotypes |
|
| filter on common fields, e.g. |
|
| filter out duplicate entries |
|
| filter unique alleles only |
|
| |
| vcfintersect | set operations—intersect, union, complement |
| vcfstreamsort | sort |
|
| merge files by overlaying |
| vcfcombine | combine samples on identical sites |
| vcffixup | update fields |
| vcfannotate | annotate records from BED file |
|
| flatten multi-allelic sites with common |
| vcfgeno2haplo | transform phased alleles into haplotypes |
|
| compare VCF files and add annotations to |
|
| design primers |
| vcf2tsv | convert to tab separated table |
|
| convert to FASTA |
| vcf2bed | convert to BED |
| vcf2sqlite | convert to SQLite |
|
| averages a set of scores over a sliding genomic window |
|
| |
|
| compute distance between positions and add field |
|
| annotates and add field for sequence entropy for a window |
|
| |
|
| split records if multiple allelic primitives (gaps or mismatches) are specified in a single VCF record |
|
| summarizes genotype counts for bi-allelic SNVs and |
|
| likelihood ratio test for haplotype lengths |
|
| |
|
| integrated ratio of haplotype decay between reference and non-reference allele |
|
| compute vFst as a measure of CNV stratification |
|
| compute a pseudo-ROC curve |
|
| plot extended haplotype homozygosity (EHH) curves |
|
| provide output for haplotype plots |
|
| population genetic statistics for each SNP |
|
| random sampling |
|
| |
|
| check integrity and identity against reference genome |
Fig 1
Fig 2Example of the VCF format and a VCF transformation to Javascript Object Notation (JSON) using bio-vcf.
(a) the line-based VCF record makes use of separators to split tab-delimited fields into subfields. Subfields are split with characters, =:;/ and so on. This splitting effectively projects a ‘tree-like’ datastructure that can also be represented as (b) a JSON record. JSON is used as a common data exchange format for databases and web-services. This example was generated with (c) the bio-vcf tool using a template [24]. bio-vcf transform data to any textual format, including RDF, HTML, XML etc. See also the bio-vcf section.
Fig 3Smoothed pFst (−log10) statistic with color coded number of variants in a window.
As computed by vcflib’s pFst and smoother tools [17].
Fig 4
Fig 6
Fig 7
Fig 8
Fig 9
Fig 10
Fig 11