Literature DB >> 28093408

VCF-kit: assorted utilities for the variant call format.

Daniel E Cook1,2, Erik C Andersen2.   

Abstract

SUMMARY: The variant call format (VCF) is a popular standard for storing genetic variation data. As a result, a large collection of tools has been developed that perform diverse analyses using VCF files. However, some tasks common to statistical and population geneticists have not been created yet. To streamline these types of analyses, we created novel tools that analyze or annotate VCF files and organized these tools into a command-line based utility named VCF-kit. VCF-kit adds essential utilities to process and analyze VCF files, including primer generation for variant validation, dendrogram production, genotype imputation from sequence data in linkage studies, and additional tools.
AVAILABILITY AND IMPLEMENTATION: https://github.com/AndersenLab/VCF-kit. CONTACT: erik.andersen@northwestern.edu.
© The Author 2017. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2017        PMID: 28093408      PMCID: PMC5423453          DOI: 10.1093/bioinformatics/btx011

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Population and quantitative genetics investigate how individuals within a population differ. The identification of these differences enables a variety of analyses to be performed. For example, genetic variation can be used to identify the basis of phenotypes, to answer evolutionary questions, or to facilitate forensics. The development of the variant call format (VCF) (Danecek ) as a standard for representing genetic variation across a group of individuals or samples has fueled the development of a large number of tools for genetic analyses using variant data. Examples include genotype imputation (Browning and Browning, 2007), variant annotation (Pedersen ), variant prediction (Cingolani ) and population genetic analysis (Pfeifer ). Despite the available tools, we found our studies required a unique set of tools that could interface directly with a VCF file. As a result, we have assembled a collection of tools into a command-line based program written in Python, VCF-kit, which functions directly on VCF files and performs a variety of unique analyses.

2 Implementation and design

VCF-kit is invoked using a command-line interface written in Python and requires a Unix-based operating system. Additionally, VCF-kit requires BWA (Li and Durbin, 2009), BLAST (Altschul ), MUSCLE (Edgar, 2004), Samtools (Li ), bcftools and PRIMER3 (Untergasser ) for certain types of analyses.

3 Usage

3.1 Reference genome management

Several of the utilities included in VCF-kit require a reference genome that has been either indexed by BWA (Li and Durbin, 2009) and Samtools (Li and Durbin, 2009) or used to generate a BLAST database. The genome command retrieves and processes reference genomes from the National Center for Biotechnology (NCBI) genomes database (Pruitt and Maglott, 2001). Reference genomes are processed with other available utilities.

3.2 Phylogenetic tree generation

VCF-kit can be used to produce a tree from a VCF using the phylo command. Variants are concatenated from each sample and then combined into a single FASTA file, where each line represents one sample. This file effectively represents a multiple sequence alignment that only incorporates variable sites from the VCF samples and can be used to calculate a difference matrix using MUSCLE (Edgar, 2004). MUSCLE outputs the tree in the Newick format, which the user may plot.

3.3 NIL and RIL calling from low coverage sequence data

Recombinant inbred lines (RILs) and near-isogenic lines (NILs) are powerful tools for understanding quantitative traits. RILs are a mixture of genotypes from two or more strains generated by various crossing schemes. NILs are nearly identical to a parental strain except for a single region that has been introgressed from the other parental genome. Both RILs and NILs are used for quantitative genetic mappings. However, genotyping all markers in many RIL or NIL strains can be expensive. To save on genotyping costs, researchers can barcode and mix samples for pooled sequencing to generate low-coverage genotype data. Low-coverage sequencing, alignment and variant calling produces VCF files with sparsely defined genotypes. If the parental strains are sequenced at higher coverages, they can be used to impute the missing genotypes of RIL and NIL strains using a Hidden-Markov Model (HMM). Because recombination events are rare between tightly linked alleles, parental genotypes can be imputed from linked alleles in RIL and NIL strains. These linked alleles might possess errors because of the nature of low-coverage sequence data. An HMM can be used to identify stretches of genotype calls and infer missing genotypes to classify regions of RIL and NIL genomes inherited from a specific parental strain. The hmm command implements the process described above. A VCF with high-quality genotype data from at least one parent and the genotypes from a population of RIL or NIL samples must be supplied. We manually set the parameters of the HMM based on the level of concordance observed with existing genotype data. Documentation includes examples detailing how to plot genotypes, which is useful to assess imputation results.

3.4 Generating primers for variant validation

The primer command generates primer sequences to validate variants using Sanger sequencing and to genotype restriction fragment length polymorphisms (RFLP) or insertion/deletion (indel) variants. When invoked, sequences flanking the desired variant from the reference genome are retrieved. Generated primers are filtered if they target multiple locations in the reference genome as determined by BLAST (Altschul ). When using the primer command for Sanger sequencing validation, a pair of primers are generated for PCR template amplification of the region with the variant. The left primer can also be used to initiate sequencing. For RFLP genotyping, the primer command calculates the expected product sizes given the restriction enzyme and size of PCR amplification product. Users are provided with the product sizes for each restriction fragment, primer sequences, restriction site locations and required restriction enzymes. Finally, primers and product sizes in the presence or absence of an indel variant are output for indel genotyping.

3.5 Call variants from sanger sequences

VCF-kit provides the call command for comparing SNVs within a VCF against Sanger sequencing for verifying variants. Users should take care when using the call command as it is not a substitute for the manual examination of chromatograms to validate variants. The call command takes a FASTA, FASTQ, or AB1 file with Sanger sequences annotated by sample and a VCF file as input. Sequences are compared by BLAST (Altschul ) against the specified reference genome and the genotypes corresponding to variant positions within the VCF are output. If the input sequence data is annotated with sample names, output variants can be classified as true positives, true negatives, false positives, or false negatives as compared with Sanger sequencing results.

3.6 Additional tools

The rename command can be used to prepend, append, or substitute strings on sample names. The vcf2tsv command can convert a SnpEff annotated VCF to a TSV. The calc command can be used to count the number of homozygous variants per sample shared with other samples (i.e. the number of singletons, doubletons, tripletons, etc. per sample). VCF-kit documentation features the full list of tools and subcommands.

4 Conclusion

VCF-kit was developed to centralize a collection of tools and scripts we have developed to streamline analyses of genetic variation. VCF-kit is open-source software. We welcome community contributions and feedback. Documentation is available at vcf-kit.readthedocs.io.

Funding

National Institutes of Health [R01GM107227] and American Cancer Society Research Scholar Award to E.C.A.; The National Science Foundation Graduate Research Fellowship [DGE-1324585] to D.E.C. Conflict of Interest: none declared.
  11 in total

1.  RefSeq and LocusLink: NCBI gene-centered resources.

Authors:  K D Pruitt; D R Maglott
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors:  Robert C Edgar
Journal:  Nucleic Acids Res       Date:  2004-03-19       Impact factor: 16.971

3.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

4.  A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors:  Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal:  Fly (Austin)       Date:  2012 Apr-Jun       Impact factor: 2.160

5.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

Authors:  Sharon R Browning; Brian L Browning
Journal:  Am J Hum Genet       Date:  2007-09-21       Impact factor: 11.025

6.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

7.  Primer3--new capabilities and interfaces.

Authors:  Andreas Untergasser; Ioana Cutcutache; Triinu Koressaar; Jian Ye; Brant C Faircloth; Maido Remm; Steven G Rozen
Journal:  Nucleic Acids Res       Date:  2012-06-22       Impact factor: 16.971

8.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

9.  PopGenome: an efficient Swiss army knife for population genomic analyses in R.

Authors:  Bastian Pfeifer; Ulrich Wittelsbürger; Sebastian E Ramos-Onsins; Martin J Lercher
Journal:  Mol Biol Evol       Date:  2014-04-16       Impact factor: 16.240

10.  Vcfanno: fast, flexible annotation of genetic variants.

Authors:  Brent S Pedersen; Ryan M Layer; Aaron R Quinlan
Journal:  Genome Biol       Date:  2016-06-01       Impact factor: 13.583

View more
  30 in total

1.  The genomic footprint of coastal earthquake uplift.

Authors:  Elahe Parvizi; Ceridwen I Fraser; Ludovic Dutoit; Dave Craw; Jonathan M Waters
Journal:  Proc Biol Sci       Date:  2020-07-08       Impact factor: 5.349

2.  Evolutionary responses of a reef-building coral to climate change at the end of the last glacial maximum.

Authors:  Jia Zhang; Zoe T Richards; Arne A S Adam; Cheong Xin Chan; Chuya Shinzato; James Gilmour; Luke Thomas; Jan M Strugnell; David J Miller; Ira Cooke
Journal:  Mol Biol Evol       Date:  2022-10-11       Impact factor: 8.800

3.  Using population selection and sequencing to characterize natural variation of starvation resistance in Caenorhabditis elegans.

Authors:  Amy K Webster; Rojin Chitrakar; Maya Powell; Jingxian Chen; Kinsey Fisher; Robyn E Tanny; Lewis Stevens; Kathryn Evans; Angela Wei; Igor Antoshechkin; Erik C Andersen; L Ryan Baugh
Journal:  Elife       Date:  2022-06-21       Impact factor: 8.713

4.  Structural variant-based pangenome construction has low sensitivity to variability of haplotype-resolved bovine assemblies.

Authors:  Alexander S Leonard; Danang Crysnanto; Zih-Hua Fang; Michael P Heaton; Brian L Vander Ley; Carolina Herrera; Heinrich Bollwein; Derek M Bickhart; Kristen L Kuhn; Timothy P L Smith; Benjamin D Rosen; Hubert Pausch
Journal:  Nat Commun       Date:  2022-05-31       Impact factor: 17.694

5.  Genomic diversity of 39 samples of Pyropia species grown in Japan.

Authors:  Yukio Nagano; Kei Kimura; Genta Kobayashi; Yoshio Kawamura
Journal:  PLoS One       Date:  2021-06-09       Impact factor: 3.240

6.  Linking genetic, morphological, and behavioural divergence between inland island and mainland deer mice.

Authors:  Joshua M Miller; Dany Garant; Charles Perrier; Tristan Juette; Joël W Jameson; Eric Normandeau; Louis Bernatchez; Denis Réale
Journal:  Heredity (Edinb)       Date:  2021-12-24       Impact factor: 3.821

7.  Independent Whole-Genome Duplications Define the Architecture of the Genomes of the Devastating West African Cacao Black Pod Pathogen Phytophthora megakarya and Its Close Relative Phytophthora palmivora.

Authors:  Abraham Morales-Cruz; Shahin S Ali; Andrea Minio; Rosa Figueroa-Balderas; Jadran F García; Takao Kasuga; Alina S Puig; Jean-Philippe Marelli; Bryan A Bailey; Dario Cantu
Journal:  G3 (Bethesda)       Date:  2020-07-07       Impact factor: 3.154

8.  Shared Genomic Regions Underlie Natural Variation in Diverse Toxin Responses.

Authors:  Kathryn S Evans; Shannon C Brady; Joshua S Bloom; Robyn E Tanny; Daniel E Cook; Sarah E Giuliani; Stephen W Hippleheuser; Mostafa Zamanian; Erik C Andersen
Journal:  Genetics       Date:  2018-10-19       Impact factor: 4.562

9.  Natural genetic variation in C. elegans identified genomic loci controlling metabolite levels.

Authors:  Arwen W Gao; Mark G Sterken; Jelmi Uit de Bos; Jelle van Creij; Rashmi Kamble; Basten L Snoek; Jan E Kammenga; Riekelt H Houtkooper
Journal:  Genome Res       Date:  2018-08-14       Impact factor: 9.043

10.  Capturing variation in Lens (Fabaceae): Development and utility of an exome capture array for lentil.

Authors:  Ezgi Ogutcen; Larissa Ramsay; Eric Bishop von Wettberg; Kirstin E Bett
Journal:  Appl Plant Sci       Date:  2018-07-16       Impact factor: 1.936

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.