Literature DB >> 22976080

VarB: a variation browsing and analysis tool for variants derived from next-generation sequencing data.

Mark D Preston1, Magnus Manske, Neil Horner, Samuel Assefa, Susana Campino, Sarah Auburn, Issaka Zongo, Jean-Bosco Ouedraogo, Francois Nosten, Tim Anderson, Taane G Clark.   

Abstract

SUMMARY: There is an immediate need for tools to both analyse and visualize in real-time single-nucleotide polymorphisms, insertions and deletions, and other structural variants from new sequence file formats. We have developed VarB software that can be used to visualize variant call format files in real time, as well as identify regions under balancing selection and informative markers to differentiate user-defined groups (e.g. populations). We demonstrate its utility using sequence data from 50 Plasmodium falciparum isolates comprising two different continents and confirm known signals from genomic regions that contain important antigenic and anti-malarial drug-resistance genes.
AVAILABILITY AND IMPLEMENTATION: The C++-based software VarB and user manual are available from www.pathogenseq.org/varb. CONTACT: taane.clark@lshtm.ac.uk

Entities:  

Mesh:

Year:  2012        PMID: 22976080      PMCID: PMC3496337          DOI: 10.1093/bioinformatics/bts557

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Massively parallel sequencing (MPS) technologies are providing whole-genome data on organisms that cause or have disease. These data are being used to catalogue genomic diversity, and in the context of research in humans, inform genome-wide and fine-scale mapping projects. The technologies are capable of sequencing a small number of human genomes in a single run, making high throughput of human genomes technically possible. Pathogen genomes are much smaller, making them tractable for large genome diversity studies, enabling the tracking of their evolution over time and space. They are also more amenable for whole-genome association studies, leading to the identification of variants associated with phenotypes such as drug resistance. To take full advantage of genetic variation exposed by whole-genome sequencing across pathogen (and other) organisms, the development of interactive analysis and real-time visualization tools is essential. The variant call format (VCF) has become the recognized data type for listing genomic variants, including single-nucleotide polymorphisms (SNPs) and insertions and deletions (indels), usually derived from processing alignment files. The VCF format was developed by the 1000 Genomes project and has been adopted by large-scale genome projects [e.g. UK10K, dbSNP (Danecek )]. It facilitates multi-sample curation and identification of SNPs, short indels and other types of structural variants and sample meta-data. The software suite VCFtools implement utilities for processing files (Danecek ). Software to visualize VCF files is available (Carver ; Fiume ; Thorvaldsdttir ), but they perform limited or no population and statistical genetic analyses in real time directly from multi-sample files. New tools are urgently required because of the increasing use of MPS technologies in genomic epidemiological studies, and the need to rapidly translate the sequence variation into further experiments. This need has motivated the development of an all-in-one variant browsing and analysis software—VarB.

2 FEATURES OF VarB

VarB is a standalone C++ software tool, which visualizes (un)phased polymorphisms in a VCF file by sample, genetic region and quality. The basic inputs are a reference genome (fasta), variant (VCF) and annotation (gff) files. Complete genomes or user-specified regions (e.g. chromosome) may be inputted and viewed, with variant genotypes being colour coded. The variants displayed and their number will change depending on the quality and read depth filtering selected, allowing users to assess the robustness of the data and analysis. Sequence data with genes, exons, coding regions and strand-dependent codons are marked at appropriate zoom levels, and tracks summarizing GC content, relative variant density and results from data analysis are presented. The Tajimas D metric () (Tajima, 1989) is implemented and is a method for distinguishing between a DNA sequence evolving randomly (‘neutrally’, values close to 0) and one evolving under a non-random process, including directional selection (low negative values) or balancing selection (high positive values). The population differentiation Fst measure is also implemented (Weir, 1996) and quantifies allele frequency differences between user-specified populations, with values ranging between 0 (no difference) and 1 (complete differentiation). It is possible to export the graphics and analysis outputs. VarB was developed using the Qt cross-platform and user interface framework (qt-project.org).

3 APPLICATION TO PLASMODIUM FALCIPARUM DATA

We demonstrate the functionality of VarB using P. falciparum (Pf) whole-genome (14 chromosomes, 23 Mb, ~81% AT content) sequence data from Burkina Faso (BF, n = 25) and Thailand (n = 25). The raw data are from a Pf genomic diversity study (Manske ; SRA Study ERP000190). The Illumina 54/76-base paired reads were mapped to the 3D7 reference genome (v3.0) using smalt (www.sanger.ac.uk/smalt) and processed as described previously (Robinson ) to construct VCF (v4.1) files consisting of SNPs and indels. Across the 50 samples, 46 283 SNPs (density 1 every 500 bp) were identified and summarized in a combined VCF file. We focus on those 23 942 (51.7%) SNPs with minor alleles observed at least twice. Figure 1 (top) shows the SNP data (n = 1790 loci, 7.5%) for chromosome 10. Estimation of identified a large region (positioned ~1.4 Mb), potentially under balancing selection (). This region includes highly polymorphic antigenic-determining genes from the merozoite surface protein 3 family [msp3 (PF10_0345); msp3.8 (PF10_0355)], known to be under diversifying selection, and considered potential vaccine candidates (Ochola ). By specifying the two population groups (BF and Thailand), the population differentiation measure Fst was calculated. There were 1153 (4.8%) SNPs with near or complete population differentiation (Fst values > 0.95) across the genome. Loci involved in anti-malarial drug resistance often show high differentiation (Wootton ). Figure 1 (bottom) shows one region of interest in chromosome 7 containing the chloroquine resistance transporter (PfCRT, MAL7.1.27, maximum Fst 0.92). This locus is known to confer resistance to anti-malarial chloroquine-based drugs, with established differences in the haplotype structure between Southeast Asian and African isolates (Wootton ). Other polymorphisms known to confer sulfadoxine-pyrimethamine drug resistance, namely regions around PfDHFR and PfDHPS genes (Pearce ) had high Fst values (>0.80) (data not shown).
Fig. 1.

(Top) Chromosome 10 (MAL10): (A) loading of files (fasta, gff, VCF) and defining groups for Fst analysis; (B) colour coding of genotypes and alleles, and the setting of minimum quality and read depth; (C) variant and group inclusion for display; (D) position slider; (E) zoom slider; (F) annotation window; (G) display window where each row represents a different sample (BF1-25, Thai1-25) and variants colour coded (see B); (H) GC content track; (I) variant density; (J) Tajimas D track, with a region of high values circled (including the PF_10_0355 gene); (K) gene search tool; (L) gene information. (Bottom) Region of MAL7 (456 k–481 k): (A) inclusion of BF and Thai groups for analysis; (B) BF25 isolate information at position 461047 in the MAL7.1.27 gene, including a genotype call of 0/0 (reference allele); (C) Fst values for each SNP, with the highest value (0.92, circled, see B).

(Top) Chromosome 10 (MAL10): (A) loading of files (fasta, gff, VCF) and defining groups for Fst analysis; (B) colour coding of genotypes and alleles, and the setting of minimum quality and read depth; (C) variant and group inclusion for display; (D) position slider; (E) zoom slider; (F) annotation window; (G) display window where each row represents a different sample (BF1-25, Thai1-25) and variants colour coded (see B); (H) GC content track; (I) variant density; (J) Tajimas D track, with a region of high values circled (including the PF_10_0355 gene); (K) gene search tool; (L) gene information. (Bottom) Region of MAL7 (456 k–481 k): (A) inclusion of BF and Thai groups for analysis; (B) BF25 isolate information at position 461047 in the MAL7.1.27 gene, including a genotype call of 0/0 (reference allele); (C) Fst values for each SNP, with the highest value (0.92, circled, see B).

4 DISCUSSION

The translation of sequence variation into further laboratory experiments, treatments and point of care interventions, requires the ability to interrogate genomic data. VarB processes files of VCF format, displays the variants by position and quality as well as comparing them between groups to establish informative genetic markers and regions under selection. An advantage of the software is that it performs population and statistical analysis in real-time and does not require calculations to be performed elsewhere. We have presented data from 50 Pf isolates to demonstrate the utility of the software and highlighted regions with differing allele frequencies that coincide with known drug-resistance loci (e.g. PfCRT) as well as known vaccine candidates arising from considering regions under balancing selection (e.g. msp3.8). It is possible to process many hundreds of Pf or smaller genomes (e.g. bacterial) simultaneously. However, to aid processing of numerous much larger genomes, such as human, the software is capable of reading in single chromosomes. The modular computing architecture provides the flexibility to incorporate a number of extensions. These include the capacity to process BCF/BCF2 files, reading in informative meta tracks (e.g. genomic uniqueness), calculating other population genetic statistics and performing tests of association with a phenotype. Increased utility will also be possible through updates to the VCF format, to identify and annotate variants of greater size, such as large deletions.
  10 in total

1.  Savant: genome browser for high-throughput sequencing data.

Authors:  Marc Fiume; Vanessa Williams; Andrew Brook; Michael Brudno
Journal:  Bioinformatics       Date:  2010-06-20       Impact factor: 6.937

2.  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors:  F Tajima
Journal:  Genetics       Date:  1989-11       Impact factor: 4.562

3.  Genetic diversity and chloroquine selective sweeps in Plasmodium falciparum.

Authors:  John C Wootton; Xiaorong Feng; Michael T Ferdig; Roland A Cooper; Jianbing Mu; Dror I Baruch; Alan J Magill; Xin-Zhuan Su
Journal:  Nature       Date:  2002-07-18       Impact factor: 49.962

4.  Allele frequency-based and polymorphism-versus-divergence indices of balancing selection in a new filtered set of polymorphic genes in Plasmodium falciparum.

Authors:  Lynette Isabella Ochola; Kevin K A Tetteh; Lindsay B Stewart; Victor Riitho; Kevin Marsh; David J Conway
Journal:  Mol Biol Evol       Date:  2010-05-09       Impact factor: 16.240

5.  BamView: visualizing and interpretation of next-generation sequencing read alignments.

Authors:  Tim Carver; Simon R Harris; Thomas D Otto; Matthew Berriman; Julian Parkhill; Jacqueline A McQuillan
Journal:  Brief Bioinform       Date:  2012-01-16       Impact factor: 11.622

6.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.

Authors:  Helga Thorvaldsdóttir; James T Robinson; Jill P Mesirov
Journal:  Brief Bioinform       Date:  2012-04-19       Impact factor: 11.622

7.  Drug-resistant genotypes and multi-clonality in Plasmodium falciparum analysed by direct genome sequencing from peripheral blood of malaria patients.

Authors:  Timothy Robinson; Susana G Campino; Sarah Auburn; Samuel A Assefa; Spencer D Polley; Magnus Manske; Bronwyn MacInnis; Kirk A Rockett; Gareth L Maslen; Mandy Sanders; Michael A Quail; Peter L Chiodini; Dominic P Kwiatkowski; Taane G Clark; Colin J Sutherland
Journal:  PLoS One       Date:  2011-08-11       Impact factor: 3.240

8.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

9.  Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing.

Authors:  Magnus Manske; Olivo Miotto; Susana Campino; Sarah Auburn; Jacob Almagro-Garcia; Gareth Maslen; Jack O'Brien; Abdoulaye Djimde; Ogobara Doumbo; Issaka Zongo; Jean-Bosco Ouedraogo; Pascal Michon; Ivo Mueller; Peter Siba; Alexis Nzila; Steffen Borrmann; Steven M Kiara; Kevin Marsh; Hongying Jiang; Xin-Zhuan Su; Chanaki Amaratunga; Rick Fairhurst; Duong Socheat; Francois Nosten; Mallika Imwong; Nicholas J White; Mandy Sanders; Elisa Anastasi; Dan Alcock; Eleanor Drury; Samuel Oyola; Michael A Quail; Daniel J Turner; Valentin Ruano-Rubio; Dushyanth Jyothi; Lucas Amenga-Etego; Christina Hubbart; Anna Jeffreys; Kate Rowlands; Colin Sutherland; Cally Roper; Valentina Mangano; David Modiano; John C Tan; Michael T Ferdig; Alfred Amambua-Ngwa; David J Conway; Shannon Takala-Harrison; Christopher V Plowe; Julian C Rayner; Kirk A Rockett; Taane G Clark; Chris I Newbold; Matthew Berriman; Bronwyn MacInnis; Dominic P Kwiatkowski
Journal:  Nature       Date:  2012-07-19       Impact factor: 49.962

10.  Multiple origins and regional dispersal of resistant dhps in African Plasmodium falciparum malaria.

Authors:  Richard J Pearce; Hirva Pota; Marie-Solange B Evehe; El-Hadj Bâ; Ghyslain Mombo-Ngoma; Allen L Malisa; Rosalynn Ord; Walter Inojosa; Alexandre Matondo; Diadier A Diallo; Wilfred Mbacham; Ingrid V van den Broek; Todd D Swarthout; Asefaw Getachew; Seyoum Dejene; Martin P Grobusch; Fanta Njie; Samuel Dunyo; Margaret Kweku; Seth Owusu-Agyei; Daniel Chandramohan; Maryline Bonnet; Jean-Paul Guthmann; Sian Clarke; Karen I Barnes; Elizabeth Streat; Stark T Katokele; Petrina Uusiku; Chris O Agboghoroma; Olufunmilayo Y Elegba; Badara Cissé; Ishraga E A-Elbasit; Hayder A Giha; S Patrick Kachur; Caroline Lynch; John B Rwakimari; Pascalina Chanda; Moonga Hawela; Brian Sharp; Inbarani Naidoo; Cally Roper
Journal:  PLoS Med       Date:  2009-04-14       Impact factor: 11.069

  10 in total
  6 in total

1.  Panoptes: web-based exploration of large scale genome variation data.

Authors:  Paul Vauterin; Ben Jeffery; Alistair Miles; Roberto Amato; Lee Hart; Ian Wright; Dominic Kwiatkowski
Journal:  Bioinformatics       Date:  2017-10-15       Impact factor: 6.937

2.  TASUKE: a web-based visualization program for large-scale resequencing data.

Authors:  Masahiko Kumagai; Jungsok Kim; Ryutaro Itoh; Takeshi Itoh
Journal:  Bioinformatics       Date:  2013-06-07       Impact factor: 6.937

3.  PlasmoView: a web-based resource to visualise global Plasmodium falciparum genomic variation.

Authors:  Mark D Preston; Samuel A Assefa; Harold Ocholla; Colin J Sutherland; Steffen Borrmann; Alexis Nzila; Pascal Michon; Tran Tinh Hien; Teun Bousema; Christopher J Drakeley; Issaka Zongo; Jean-Bosco Ouédraogo; Abdoulaye A Djimde; Ogobara K Doumbo; Francois Nosten; Rick M Fairhurst; David J Conway; Cally Roper; Taane G Clark
Journal:  J Infect Dis       Date:  2013-12-12       Impact factor: 5.226

4.  A barcode of organellar genome polymorphisms identifies the geographic origin of Plasmodium falciparum strains.

Authors:  Mark D Preston; Susana Campino; Samuel A Assefa; Diego F Echeverry; Harold Ocholla; Alfred Amambua-Ngwa; Lindsay B Stewart; David J Conway; Steffen Borrmann; Pascal Michon; Issaka Zongo; Jean-Bosco Ouédraogo; Abdoulaye A Djimde; Ogobara K Doumbo; Francois Nosten; Arnab Pain; Teun Bousema; Chris J Drakeley; Rick M Fairhurst; Colin J Sutherland; Cally Roper; Taane G Clark
Journal:  Nat Commun       Date:  2014-06-13       Impact factor: 14.919

5.  estMOI: estimating multiplicity of infection using parasite deep sequencing data.

Authors:  Samuel A Assefa; Mark D Preston; Susana Campino; Harold Ocholla; Colin J Sutherland; Taane G Clark
Journal:  Bioinformatics       Date:  2014-01-17       Impact factor: 6.937

6.  SVAMP: sequence variation analysis, maps and phylogeny.

Authors:  Raeece Naeem; Lailatul Hidayah; Mark D Preston; Taane G Clark; Arnab Pain
Journal:  Bioinformatics       Date:  2014-04-03       Impact factor: 6.937

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.