Daniel E Cook1,2, Erik C Andersen2. 1. Interdisciplinary Biological Sciences Program, Northwestern University, Evanston, IL, USA. 2. Department of Molecular Biosciences, Northwestern University, Evanston, IL, USA.
Abstract
SUMMARY: The variant call format (VCF) is a popular standard for storing genetic variation data. As a result, a large collection of tools has been developed that perform diverse analyses using VCF files. However, some tasks common to statistical and population geneticists have not been created yet. To streamline these types of analyses, we created novel tools that analyze or annotate VCF files and organized these tools into a command-line based utility named VCF-kit. VCF-kit adds essential utilities to process and analyze VCF files, including primer generation for variant validation, dendrogram production, genotype imputation from sequence data in linkage studies, and additional tools. AVAILABILITY AND IMPLEMENTATION: https://github.com/AndersenLab/VCF-kit. CONTACT: erik.andersen@northwestern.edu.
SUMMARY: The variant call format (VCF) is a popular standard for storing genetic variation data. As a result, a large collection of tools has been developed that perform diverse analyses using VCF files. However, some tasks common to statistical and population geneticists have not been created yet. To streamline these types of analyses, we created novel tools that analyze or annotate VCF files and organized these tools into a command-line based utility named VCF-kit. VCF-kit adds essential utilities to process and analyze VCF files, including primer generation for variant validation, dendrogram production, genotype imputation from sequence data in linkage studies, and additional tools. AVAILABILITY AND IMPLEMENTATION: https://github.com/AndersenLab/VCF-kit. CONTACT: erik.andersen@northwestern.edu.
Population and quantitative genetics investigate how individuals within a population differ. The identification of these differences enables a variety of analyses to be performed. For example, genetic variation can be used to identify the basis of phenotypes, to answer evolutionary questions, or to facilitate forensics. The development of the variant call format (VCF) (Danecek ) as a standard for representing genetic variation across a group of individuals or samples has fueled the development of a large number of tools for genetic analyses using variant data. Examples include genotype imputation (Browning and Browning, 2007), variant annotation (Pedersen ), variant prediction (Cingolani ) and population genetic analysis (Pfeifer ). Despite the available tools, we found our studies required a unique set of tools that could interface directly with a VCF file. As a result, we have assembled a collection of tools into a command-line based program written in Python, VCF-kit, which functions directly on VCF files and performs a variety of unique analyses.
2 Implementation and design
VCF-kit is invoked using a command-line interface written in Python and requires a Unix-based operating system. Additionally, VCF-kit requires BWA (Li and Durbin, 2009), BLAST (Altschul ), MUSCLE (Edgar, 2004), Samtools (Li ), bcftools and PRIMER3 (Untergasser ) for certain types of analyses.
3 Usage
3.1 Reference genome management
Several of the utilities included in VCF-kit require a reference genome that has been either indexed by BWA (Li and Durbin, 2009) and Samtools (Li and Durbin, 2009) or used to generate a BLAST database. The genome command retrieves and processes reference genomes from the National Center for Biotechnology (NCBI) genomes database (Pruitt and Maglott, 2001). Reference genomes are processed with other available utilities.
3.2 Phylogenetic tree generation
VCF-kit can be used to produce a tree from a VCF using the phylo command. Variants are concatenated from each sample and then combined into a single FASTA file, where each line represents one sample. This file effectively represents a multiple sequence alignment that only incorporates variable sites from the VCF samples and can be used to calculate a difference matrix using MUSCLE (Edgar, 2004). MUSCLE outputs the tree in the Newick format, which the user may plot.
3.3 NIL and RIL calling from low coverage sequence data
Recombinant inbred lines (RILs) and near-isogenic lines (NILs) are powerful tools for understanding quantitative traits. RILs are a mixture of genotypes from two or more strains generated by various crossing schemes. NILs are nearly identical to a parental strain except for a single region that has been introgressed from the other parental genome. Both RILs and NILs are used for quantitative genetic mappings. However, genotyping all markers in many RIL or NIL strains can be expensive.To save on genotyping costs, researchers can barcode and mix samples for pooled sequencing to generate low-coverage genotype data. Low-coverage sequencing, alignment and variant calling produces VCF files with sparsely defined genotypes. If the parental strains are sequenced at higher coverages, they can be used to impute the missing genotypes of RIL and NIL strains using a Hidden-Markov Model (HMM). Because recombination events are rare between tightly linked alleles, parental genotypes can be imputed from linked alleles in RIL and NIL strains. These linked alleles might possess errors because of the nature of low-coverage sequence data. An HMM can be used to identify stretches of genotype calls and infer missing genotypes to classify regions of RIL and NIL genomes inherited from a specific parental strain.The hmm command implements the process described above. A VCF with high-quality genotype data from at least one parent and the genotypes from a population of RIL or NIL samples must be supplied. We manually set the parameters of the HMM based on the level of concordance observed with existing genotype data. Documentation includes examples detailing how to plot genotypes, which is useful to assess imputation results.
3.4 Generating primers for variant validation
The primer command generates primer sequences to validate variants using Sanger sequencing and to genotype restriction fragment length polymorphisms (RFLP) or insertion/deletion (indel) variants. When invoked, sequences flanking the desired variant from the reference genome are retrieved. Generated primers are filtered if they target multiple locations in the reference genome as determined by BLAST (Altschul ).When using the primer command for Sanger sequencing validation, a pair of primers are generated for PCR template amplification of the region with the variant. The left primer can also be used to initiate sequencing. For RFLP genotyping, the primer command calculates the expected product sizes given the restriction enzyme and size of PCR amplification product. Users are provided with the product sizes for each restriction fragment, primer sequences, restriction site locations and required restriction enzymes. Finally, primers and product sizes in the presence or absence of an indel variant are output for indel genotyping.
3.5 Call variants from sanger sequences
VCF-kit provides the call command for comparing SNVs within a VCF against Sanger sequencing for verifying variants. Users should take care when using the call command as it is not a substitute for the manual examination of chromatograms to validate variants. The call command takes a FASTA, FASTQ, or AB1 file with Sanger sequences annotated by sample and a VCF file as input. Sequences are compared by BLAST (Altschul ) against the specified reference genome and the genotypes corresponding to variant positions within the VCF are output. If the input sequence data is annotated with sample names, output variants can be classified as true positives, true negatives, false positives, or false negatives as compared with Sanger sequencing results.
3.6 Additional tools
The rename command can be used to prepend, append, or substitute strings on sample names. The vcf2tsv command can convert a SnpEff annotated VCF to a TSV. The calc command can be used to count the number of homozygous variants per sample shared with other samples (i.e. the number of singletons, doubletons, tripletons, etc. per sample). VCF-kit documentation features the full list of tools and subcommands.
4 Conclusion
VCF-kit was developed to centralize a collection of tools and scripts we have developed to streamline analyses of genetic variation. VCF-kit is open-source software. We welcome community contributions and feedback. Documentation is available at vcf-kit.readthedocs.io.
Funding
National Institutes of Health [R01GM107227] and American Cancer Society Research Scholar Award to E.C.A.; The National Science Foundation Graduate Research Fellowship [DGE-1324585] to D.E.C.Conflict of Interest: none declared.
Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937
Authors: Jia Zhang; Zoe T Richards; Arne A S Adam; Cheong Xin Chan; Chuya Shinzato; James Gilmour; Luke Thomas; Jan M Strugnell; David J Miller; Ira Cooke Journal: Mol Biol Evol Date: 2022-10-11 Impact factor: 8.800
Authors: Amy K Webster; Rojin Chitrakar; Maya Powell; Jingxian Chen; Kinsey Fisher; Robyn E Tanny; Lewis Stevens; Kathryn Evans; Angela Wei; Igor Antoshechkin; Erik C Andersen; L Ryan Baugh Journal: Elife Date: 2022-06-21 Impact factor: 8.713
Authors: Alexander S Leonard; Danang Crysnanto; Zih-Hua Fang; Michael P Heaton; Brian L Vander Ley; Carolina Herrera; Heinrich Bollwein; Derek M Bickhart; Kristen L Kuhn; Timothy P L Smith; Benjamin D Rosen; Hubert Pausch Journal: Nat Commun Date: 2022-05-31 Impact factor: 17.694
Authors: Abraham Morales-Cruz; Shahin S Ali; Andrea Minio; Rosa Figueroa-Balderas; Jadran F García; Takao Kasuga; Alina S Puig; Jean-Philippe Marelli; Bryan A Bailey; Dario Cantu Journal: G3 (Bethesda) Date: 2020-07-07 Impact factor: 3.154
Authors: Kathryn S Evans; Shannon C Brady; Joshua S Bloom; Robyn E Tanny; Daniel E Cook; Sarah E Giuliani; Stephen W Hippleheuser; Mostafa Zamanian; Erik C Andersen Journal: Genetics Date: 2018-10-19 Impact factor: 4.562
Authors: Arwen W Gao; Mark G Sterken; Jelmi Uit de Bos; Jelle van Creij; Rashmi Kamble; Basten L Snoek; Jan E Kammenga; Riekelt H Houtkooper Journal: Genome Res Date: 2018-08-14 Impact factor: 9.043